Computation and Language 131
☆ L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
We rigorously establish a bipartite mutual information scaling law in natural
language that governs long-range dependencies. This scaling law, which we show
is distinct from and scales independently of the conventional two-point mutual
information, is the key to understanding long-context language modeling. Using
this scaling law, we formulate the Long-context Language Modeling (L$^2$M)
condition, which relates a model's capacity for effective long context length
modeling to the scaling of its latent state size for storing past information.
Our results are validated through experiments on both transformers and state
space models. This work establishes a theoretical foundation that guides the
development of large language models toward longer context lengths.
comment: 29 pages, 12 figures, 1 table
☆ LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
Recent advancements in speech-to-speech dialogue systems leverage LLMs for
multimodal interactions, yet they remain hindered by fine-tuning requirements,
high computational overhead, and text-speech misalignment. Existing
speech-enabled LLMs often degrade conversational quality by modifying the LLM,
thereby compromising its linguistic capabilities. In contrast, we propose
LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS
system that generates high-quality speech with low latency, while fully
preserving the capabilities of the base LLM. Our approach achieves a
significantly lower Word Error Rate compared to speech-enabled LLMs, while
operating at comparable latency and UTMOS score. By decoupling speech synthesis
from LLM processing via a multi-queue token streaming system, LLMVoX supports
seamless, infinite-length dialogues. Its plug-and-play design also facilitates
extension to various tasks with different backbones. Furthermore, LLMVoX
generalizes to new languages with only dataset adaptation, attaining a low
Character Error Rate on an Arabic speech task. Additionally, we have integrated
LLMVoX with a Vision-Language Model to create an omni-model with speech, text,
and vision capabilities, without requiring additional multimodal training. Our
code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
☆ Shifting Long-Context LLMs Research from Input to Output
Recent advancements in long-context Large Language Models (LLMs) have
primarily concentrated on processing extended input contexts, resulting in
significant strides in long-context comprehension. However, the equally
critical aspect of generating long-form outputs has received comparatively less
attention. This paper advocates for a paradigm shift in NLP research toward
addressing the challenges of long-output generation. Tasks such as novel
writing, long-term planning, and complex reasoning require models to understand
extensive contexts and produce coherent, contextually rich, and logically
consistent extended text. These demands highlight a critical gap in current LLM
capabilities. We underscore the importance of this under-explored domain and
call for focused efforts to develop foundational LLMs tailored for generating
high-quality, long-form outputs, which hold immense potential for real-world
applications.
comment: Preprint
☆ Enough Coin Flips Can Make LLMs Act Bayesian
Large language models (LLMs) exhibit the ability to generalize given few-shot
examples in their input prompt, an emergent capability known as in-context
learning (ICL). We investigate whether LLMs utilize ICL to perform structured
reasoning in ways that are consistent with a Bayesian framework or rely on
pattern matching. Using a controlled setting of biased coin flips, we find
that: (1) LLMs often possess biased priors, causing initial divergence in
zero-shot settings, (2) in-context evidence outweighs explicit bias
instructions, (3) LLMs broadly follow Bayesian posterior updates, with
deviations primarily due to miscalibrated priors rather than flawed updates,
and (4) attention magnitude has negligible effect on Bayesian inference. With
sufficient demonstrations of biased coin flips via ICL, LLMs update their
priors in a Bayesian manner.
☆ Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, Hung-yi Lee
Spoken dialogue modeling introduces unique challenges beyond text-based
language modeling, demanding robust turn-taking, backchanneling, and real-time
interaction. Although most Spoken Dialogue Models (SDMs) rely on half-duplex
processing (handling speech one turn at a time), emerging full-duplex SDMs can
listen and speak simultaneously, enabling more natural and engaging
conversations. However, current evaluations of such models remain limited,
often focusing on turn-based metrics or high-level corpus analyses (e.g., turn
gaps, pauses). To address this gap, we present Full-Duplex-Bench, a new
benchmark that systematically evaluates key conversational behaviors: pause
handling, backchanneling, turn-taking, and interruption management. Our
framework uses automatic metrics for consistent and reproducible assessments of
SDMs' interactive performance. By offering an open and standardized evaluation
benchmark, we aim to advance spoken dialogue modeling and encourage the
development of more interactive and natural dialogue systems.
☆ Scaling Rich Style-Prompted Text-to-Speech Datasets
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale
dataset that annotates speech utterances with rich style captions. While rich
abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale
human-annotated datasets, existing large-scale datasets only cover basic tags
(e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech
embedders, classifiers and an audio language model to automatically scale rich
tag annotations for the first time. ParaSpeechCaps covers a total of 59 style
tags, including both speaker-level intrinsic tags and utterance-level
situational tags. It consists of 342 hours of human-labelled data (PSC-Base)
and 2427 hours of automatically annotated data (PSC-Scaled). We finetune
Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and
achieve improved style consistency (+7.9% Consistency MOS) and speech quality
(+15.5% Naturalness MOS) over the best performing baseline that combines
existing rich style tag datasets. We ablate several of our dataset design
choices to lay the foundation for future work in this space. Our dataset,
models and code are released at https://github.com/ajd12342/paraspeechcaps .
☆ L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Reasoning language models have shown an uncanny ability to improve
performance at test-time by ``thinking longer''-that is, by generating longer
chain-of-thought sequences and hence using more compute. However, the length of
their chain-of-thought reasoning is not controllable, making it impossible to
allocate test-time compute to achieve a desired level of performance. We
introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement
learning method that optimizes for accuracy and adherence to user-specified
length constraints. We use LCPO to train L1, a reasoning language model that
produces outputs satisfying a length constraint given in its prompt. L1's
length control allows for smoothly trading off computational cost and accuracy
on a wide range of tasks, and outperforms the state-of-the-art S1 method for
length control. Furthermore, we uncover an unexpected short chain-of-thought
capability in models trained with LCPO. For instance, our 1.5B L1 model
surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise
control over reasoning length, allowing for fine-grained allocation of
test-time compute and accuracy. We release code and models at
https://www.cmu-l3.github.io/l1
☆ UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets
Large Language Models (LLMs) inevitably acquire harmful information during
training on massive datasets. LLM unlearning aims to eliminate the influence of
such harmful information while maintaining the model's overall performance.
Existing unlearning methods, represented by gradient ascent-based approaches,
primarily focus on forgetting target data while overlooking the crucial impact
of logically related knowledge on the effectiveness of unlearning. In this
paper, through both theoretical and experimental analyses, we first demonstrate
that a key reason for the suboptimal unlearning performance is that models can
reconstruct the target content through reasoning with logically related
knowledge. To address this issue, we propose Unlearning Improvement via
Parameter Extrapolation (UIPE), a method that removes knowledge highly
correlated with the forgetting targets. Experimental results show that UIPE
significantly enhances the performance of various mainstream LLM unlearning
methods on the TOFU benchmark.
☆ Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases
The latest reasoning-enhanced large language models (reasoning LLMs), such as
DeepSeek-R1 and OpenAI-o3, have demonstrated remarkable success. However, the
application of such reasoning enhancements to the highly professional medical
domain has not been clearly evaluated, particularly regarding with not only
assessing the final generation but also examining the quality of their
reasoning processes. In this study, we present MedR-Bench, a reasoning-focused
medical evaluation benchmark comprising 1,453 structured patient cases with
reasoning references mined from case reports. Our benchmark spans 13 body
systems and 10 specialty disorders, encompassing both common and rare diseases.
In our evaluation, we introduce a versatile framework consisting of three
critical clinical stages: assessment recommendation, diagnostic
decision-making, and treatment planning, comprehensively capturing the LLMs'
performance across the entire patient journey in healthcare. For metrics, we
propose a novel agentic system, Reasoning Evaluator, designed to automate and
objectively quantify free-text reasoning responses in a scalable manner from
the perspectives of efficiency, factuality, and completeness by dynamically
searching and performing cross-referencing checks. As a result, we assess five
state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and
others. Our results reveal that current LLMs can handle relatively simple
diagnostic tasks with sufficient critical assessment results, achieving
accuracy generally over 85%. However, they still struggle with more complex
tasks, such as assessment recommendation and treatment planning. In reasoning,
their reasoning processes are generally reliable, with factuality scores
exceeding 90%, though they often omit critical reasoning steps. Our study
clearly reveals further development directions for current clinical LLMs.
☆ DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module
We look at reasoning on GSM8k, a dataset of short texts presenting primary
school, math problems. We find, with Mirzadeh et al. (2024), that current LLM
progress on the data set may not be explained by better reasoning but by
exposure to a broader pretraining data distribution. We then introduce a novel
information source for helping models with less data or inferior training
reason better: discourse structure. We show that discourse structure improves
performance for models like Llama2 13b by up to 160%. Even for models that have
most likely memorized the data set, adding discourse structural information to
the model still improves predictions and dramatically improves large model
performance on out of distribution examples.
☆ LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue NAACL 2025
Understanding user satisfaction with conversational systems, known as User
Satisfaction Estimation (USE), is essential for assessing dialogue quality and
enhancing user experiences. However, existing methods for USE face challenges
due to limited understanding of underlying reasons for user dissatisfaction and
the high costs of annotating user intentions. To address these challenges, we
propose PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction
Estimation), an interpretable framework for effective user satisfaction
prediction. PRAISE operates through three key modules. The Strategy Planner
develops strategies, which are natural language criteria for classifying user
satisfaction. The Feature Retriever then incorporates knowledge on user
satisfaction from Large Language Models (LLMs) and retrieves relevance features
from utterances. Finally, the Score Analyzer evaluates strategy predictions and
classifies user satisfaction. Experimental results demonstrate that PRAISE
achieves state-of-the-art performance on three benchmarks for the USE task.
Beyond its superior performance, PRAISE offers additional benefits. It enhances
interpretability by providing instance-level explanations through effective
alignment of utterances with strategies. Moreover, PRAISE operates more
efficiently than existing approaches by eliminating the need for LLMs during
the inference phase.
comment: Accepted by NAACL 2025
☆ An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding AAAI 2025
This paper proposes a new principled multi-task representation learning
framework (InfoMTL) to extract noise-invariant sufficient representations for
all tasks. It ensures sufficiency of shared representations for all tasks and
mitigates the negative effect of redundant features, which can enhance language
understanding of pre-trained language models (PLMs) under the multi-task
paradigm. Firstly, a shared information maximization principle is proposed to
learn more sufficient shared representations for all target tasks. It can avoid
the insufficiency issue arising from representation compression in the
multi-task paradigm. Secondly, a task-specific information minimization
principle is designed to mitigate the negative effect of potential redundant
features in the input for each task. It can compress task-irrelevant redundant
information and preserve necessary information relevant to the target for
multi-task prediction. Experiments on six classification benchmarks show that
our method outperforms 12 comparative multi-task methods under the same
multi-task settings, especially in data-constrained and noisy scenarios.
Extensive experiments demonstrate that the learned representations are more
sufficient, data-efficient, and robust.
comment: 11 pages, accepted to AAAI 2025 (main conference), the code is
available at https://github.com/zerohd4869/InfoMTL
☆ Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment
Direct Preference Optimization (DPO) has become a prominent method for
aligning Large Language Models (LLMs) with human preferences. While DPO has
enabled significant progress in aligning English LLMs, multilingual preference
alignment is hampered by data scarcity. To address this, we propose a novel
approach that $\textit{captures}$ learned preferences from well-aligned English
models by implicit rewards and $\textit{transfers}$ them to other languages
through iterative training. Specifically, we derive an implicit reward model
from the logits of an English DPO-aligned model and its corresponding reference
model. This reward model is then leveraged to annotate preference relations in
cross-lingual instruction-following pairs, using English instructions to
evaluate multilingual responses. The annotated data is subsequently used for
multilingual DPO fine-tuning, facilitating preference knowledge transfer from
English to other languages. Fine-tuning Llama3 for two iterations resulted in a
12.72% average improvement in Win Rate and a 5.97% increase in Length Control
Win Rate across all training languages on the X-AlpacaEval leaderboard. Our
findings demonstrate that leveraging existing English-aligned models can enable
efficient and effective multilingual preference alignment, significantly
reducing the need for extensive multilingual preference data. The code is
available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding
comment: Work in progress
☆ IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval NAACL 2025
We introduce IFIR, the first comprehensive benchmark designed to evaluate
instruction-following information retrieval (IR) in expert domains. IFIR
includes 2,426 high-quality examples and covers eight subsets across four
specialized domains: finance, law, healthcare, and science literature. Each
subset addresses one or more domain-specific retrieval tasks, replicating
real-world scenarios where customized instructions are critical. IFIR enables a
detailed analysis of instruction-following retrieval capabilities by
incorporating instructions at different levels of complexity. We also propose a
novel LLM-based evaluation method to provide a more precise and reliable
assessment of model performance in following instructions. Through extensive
experiments on 15 frontier retrieval models, including those based on LLMs, our
results reveal that current models face significant challenges in effectively
following complex, domain-specific instructions. We further provide in-depth
analyses to highlight these limitations, offering valuable insights to guide
future advancements in retriever development.
comment: NAACL 2025 Main
☆ Mark Your LLM: Detecting the Misuse of Open-Source Large Language Models via Watermarking ICLR 2025
As open-source large language models (LLMs) like Llama3 become more capable,
it is crucial to develop watermarking techniques to detect their potential
misuse. Existing watermarking methods either add watermarks during LLM
inference, which is unsuitable for open-source LLMs, or primarily target
classification LLMs rather than recent generative LLMs. Adapting these
watermarks to open-source LLMs for misuse detection remains an open challenge.
This work defines two misuse scenarios for open-source LLMs: intellectual
property (IP) violation and LLM Usage Violation. Then, we explore the
application of inference-time watermark distillation and backdoor watermarking
in these contexts. We propose comprehensive evaluation methods to assess the
impact of various real-world further fine-tuning scenarios on watermarks and
the effect of these watermarks on LLM performance. Our experiments reveal that
backdoor watermarking could effectively detect IP Violation, while
inference-time watermark distillation is applicable in both scenarios but less
robust to further fine-tuning and has a more significant impact on LLM
performance compared to backdoor watermarking. Exploring more advanced
watermarking methods for open-source LLMs to detect their misuse should be an
important future direction.
comment: Accepted by the 1st Workshop on GenAI Watermarking, collocated with
ICLR 2025
☆ SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
Survey paper plays a crucial role in scientific research, especially given
the rapid growth of research publications. Recently, researchers have begun
using LLMs to automate survey generation for better efficiency. However, the
quality gap between LLM-generated surveys and those written by human remains
significant, particularly in terms of outline quality and citation accuracy. To
close these gaps, we introduce SurveyForge, which first generates the outline
by analyzing the logical structure of human-written outlines and referring to
the retrieved domain-related articles. Subsequently, leveraging high-quality
papers retrieved from memory by our scholar navigation agent, SurveyForge can
automatically generate and refine the content of the generated article.
Moreover, to achieve a comprehensive evaluation, we construct SurveyBench,
which includes 100 human-written survey papers for win-rate comparison and
assesses AI-generated survey papers across three dimensions: reference,
outline, and content quality. Experiments demonstrate that SurveyForge can
outperform previous works such as AutoSurvey.
comment: Code and dataset are available for downloading at:
https://github.com/Alpha-Innovator/SurveyForge 22 pages, 10 figures
☆ START: Self-taught Reasoner with Tools
Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have
demonstrated remarkable capabilities in complex reasoning tasks through the
utilization of long Chain-of-thought (CoT). However, these models often suffer
from hallucinations and inefficiencies due to their reliance solely on internal
reasoning processes. In this paper, we introduce START (Self-Taught Reasoner
with Tools), a novel tool-integrated long CoT reasoning LLM that significantly
enhances reasoning capabilities by leveraging external tools. Through code
execution, START is capable of performing complex computations, self-checking,
exploring diverse methods, and self-debugging, thereby addressing the
limitations of LRMs. The core innovation of START lies in its self-learning
framework, which comprises two key techniques: 1) Hint-infer: We demonstrate
that inserting artificially designed hints (e.g., ``Wait, maybe using Python
here is a good idea.'') during the inference process of a LRM effectively
stimulates its ability to utilize external tools without the need for any
demonstration data. Hint-infer can also serve as a simple and effective
sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning
(Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and
modifying the reasoning trajectories with tool invocation generated by a LRM
via Hint-infer, followed by fine-tuning the LRM. Through this framework, we
have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA
(GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the
competition-level code benchmark (LiveCodeBench), START achieves accuracy rates
of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly
outperforms the base QwQ-32B and achieves performance comparable to the
state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary
model o1-Preview.
comment: 38 pages, 5 figures and 6 tables
☆ SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling
User reviews on e-commerce platforms exhibit dynamic sentiment patterns
driven by temporal and contextual factors. Traditional sentiment analysis
methods focus on static reviews, failing to capture the evolving temporal
relationship between user sentiment rating and textual content. Sentiment
analysis on streaming reviews addresses this limitation by modeling and
predicting the temporal evolution of user sentiments. However, it suffers from
data sparsity, manifesting in temporal, spatial, and combined forms. In this
paper, we introduce SynGraph, a novel framework designed to address data
sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data
sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios
and incorporating LLM-augmented enhancements within a dynamic graph-based
structure. Experiments on real-world datasets demonstrate its effectiveness in
addressing sparsity and improving sentiment modeling in streaming reviews.
comment: 18 pages, 17 figures
☆ Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Rui Zheng, Nijun Li, Tao Gui, Yun Li, Qi Zhang, Xuanjing Huang
Process supervision, i.e., evaluating each step, is critical for complex
large language model (LLM) reasoning and test-time searching with increased
inference compute. Existing approaches, represented by process reward models
(PRMs), primarily focus on rewarding signals up to the current step, exhibiting
a one-directional nature and lacking a mechanism to model the distance to the
final target. To address this problem, we draw inspiration from the A*
algorithm, which states that an effective supervisory signal should
simultaneously consider the incurred cost and the estimated cost for reaching
the target. Building on this key insight, we introduce BiRM, a novel process
supervision model that not only evaluates the correctness of previous steps but
also models the probability of future success. We conduct extensive experiments
on mathematical reasoning tasks and demonstrate that BiRM provides more precise
evaluations of LLM reasoning steps, achieving an improvement of 3.1% on
Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in
search-based strategies, BiRM provides more comprehensive guidance and
outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
☆ HalluCounter: Reference-free LLM Hallucination Detection in the Wild!
Ashok Urlana, Gopichand Kanumolu, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, Rahul Mishra
Response consistency-based, reference-free hallucination detection (RFHD)
methods do not depend on internal model states, such as generation
probabilities or gradients, which Grey-box models typically rely on but are
inaccessible in closed-source LLMs. However, their inability to capture
query-response alignment patterns often results in lower detection accuracy.
Additionally, the lack of large-scale benchmark datasets spanning diverse
domains remains a challenge, as most existing datasets are limited in size and
scope. To this end, we propose HalluCounter, a novel reference-free
hallucination detection method that utilizes both response-response and
query-response consistency and alignment patterns. This enables the training of
a classifier that detects hallucinations and provides a confidence score and an
optimal response for user queries. Furthermore, we introduce HalluCounterEval,
a benchmark dataset comprising both synthetically generated and human-curated
samples across multiple domains. Our method outperforms state-of-the-art
approaches by a significant margin, achieving over 90\% average confidence in
hallucination detection across datasets.
comment: 30 pages, 4 figures
☆ Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning
In this work, we explain our approach employed in the BabyLM Challenge, which
uses various methods of training language models (LMs) with significantly less
data compared to traditional large language models (LLMs) and are inspired by
how human children learn. While a human child is exposed to far less linguistic
input than an LLM, they still achieve remarkable language understanding and
generation abilities. To this end, we develop a model trained on a curated
dataset consisting of 10 million words, primarily sourced from child-directed
transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered
to 8.5M. Next, it is supplemented with a randomly selected subset of TVR
dataset consisting of 1.5M words of television dialogues. The latter dataset
ensures that similar to children, the model is also exposed to language through
media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it
with the limited vocabulary of children in the early stages of language
acquisition. We use curriculum learning and is able to match the baseline on
certain benchmarks while surpassing the baseline on others. Additionally,
incorporating common LLM training datasets, such as MADLAD-400, degrades
performance. These findings underscore the importance of dataset selection,
vocabulary scaling, and curriculum learning in creating more data-efficient
language models that better mimic human learning processes.
comment: 5 pages
☆ The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Recent advancements in text-to-video (T2V) generation have been driven by two
competing paradigms: autoregressive language models and diffusion models.
However, each paradigm has intrinsic limitations: language models struggle with
visual quality and error accumulation, while diffusion models lack semantic
understanding and causal modeling. In this work, we propose LanDiff, a hybrid
framework that synergizes the strengths of both paradigms through
coarse-to-fine generation. Our architecture introduces three key innovations:
(1) a semantic tokenizer that compresses 3D visual features into compact 1D
discrete representations through efficient semantic compression, achieving a
$\sim$14,000$\times$ compression ratio; (2) a language model that generates
semantic tokens with high-level semantic relationships; (3) a streaming
diffusion model that refines coarse semantics into high-fidelity videos.
Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the
VBench T2V benchmark, surpassing the state-of-the-art open-source models
Hunyuan Video (13B) and other commercial models such as Sora, Keling, and
Hailuo. Furthermore, our model also achieves state-of-the-art performance in
long video generation, surpassing other open-source models in this field. Our
demo can be viewed at https://landiff.github.io/.
☆ HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Transformers have become the de facto architecture for a wide range of
machine learning tasks, particularly in large language models (LLMs). Despite
their remarkable performance, challenges remain in training deep transformer
networks, especially regarding the location of layer normalization. While
Pre-Norm structures facilitate easier training due to their more prominent
identity path, they often yield suboptimal performance compared to Post-Norm.
In this paper, we propose $\textbf{HybridNorm}$, a straightforward yet
effective hybrid normalization strategy that integrates the advantages of both
Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV
normalization within the attention mechanism and Post-Norm in the feed-forward
network (FFN) of each transformer block. This design not only stabilizes
training but also enhances performance, particularly in the context of LLMs.
Comprehensive experiments in both dense and sparse architectures show that
HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches,
achieving state-of-the-art results across various benchmarks. These findings
highlight the potential of HybridNorm as a more stable and effective technique
for improving the training and performance of deep transformer models. %Code
will be made publicly available. Code is available at
https://github.com/BryceZhuo/HybridNorm.
☆ Compositional Causal Reasoning Evaluation in Language Models
Causal reasoning and compositional reasoning are two core aspirations in
generative AI. Measuring the extent of these behaviors requires principled
evaluation methods. We explore a unified perspective that considers both
behaviors simultaneously, termed compositional causal reasoning (CCR): the
ability to infer how causal measures compose and, equivalently, how causal
quantities propagate through graphs. We instantiate a framework for the
systematic evaluation of CCR for the average treatment effect and the
probability of necessity and sufficiency. As proof of concept, we demonstrate
the design of CCR tasks for language models in the LLama, Phi, and GPT
families. On a math word problem, our framework revealed a range of
taxonomically distinct error patterns. Additionally, CCR errors increased with
the complexity of causal paths for all models except o1.
☆ Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
The ability of generative large language models (LLMs) to perform in-context
learning has given rise to a large body of research into how best to prompt
models for various natural language processing tasks. Machine Translation (MT)
has been shown to benefit from in-context examples, in particular when they are
semantically similar to the sentence to translate. In this paper, we propose a
new LLM-based translation paradigm, compositional translation, to replace naive
few-shot MT with similarity-based demonstrations. An LLM is used to decompose a
sentence into simpler phrases, and then to translate each phrase with the help
of retrieved demonstrations. Finally, the LLM is prompted to translate the
initial sentence with the help of the self-generated phrase-translation pairs.
Our intuition is that this approach should improve translation because these
shorter phrases should be intrinsically easier to translate and easier to match
with relevant examples. This is especially beneficial in low-resource
scenarios, and more generally whenever the selection pool is small or out of
domain. We show that compositional translation boosts LLM translation
performance on a wide range of popular MT benchmarks, including FLORES 200,
NTREX 128 and TICO-19. Code and outputs are available at
https://github.com/ArmelRandy/compositional-translation
☆ An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
In this report, we present the third technical report on the development of
slow-thinking models as part of the STILL project. As the technical pathway
becomes clearer, scaling RL training has become a central technique for
implementing such reasoning models. We systematically experiment with and
document the effects of various factors influencing RL training, conducting
experiments on both base models and fine-tuned models. Specifically, we
demonstrate that our RL training approach consistently improves the Qwen2.5-32B
base models, enhancing both response length and test accuracy. Furthermore, we
show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already
achieved a high performance level, it can be further refined through RL
training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we
also explore the use of tool manipulation, finding that it significantly boosts
the reasoning performance of large reasoning models. This approach achieves a
remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its
effectiveness in enhancing model capabilities. We release our resources at the
STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.
comment: Technical Report on Slow Thinking with LLMs: Part III
☆ Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic
reasoning to address complex tasks such as image captioning and visual question
answering. While MLLMs demonstrate remarkable versatility, MLLMs appears
limited performance on special applications. But tuning MLLMs for downstream
tasks encounters two key challenges: Task-Expert Specialization, where
distribution shifts between pre-training and target datasets constrain target
performance, and Open-World Stabilization, where catastrophic forgetting erases
the model general knowledge. In this work, we systematically review recent
advancements in MLLM tuning methodologies, classifying them into three
paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III)
Reparameterization Tuning. Furthermore, we benchmark these tuning strategies
across popular MLLM architectures and diverse downstream tasks to establish
standardized evaluation analysis and systematic tuning principles. Finally, we
highlight several open challenges in this domain and propose future research
directions. To facilitate ongoing progress in this rapidly evolving field, we
provide a public repository that continuously tracks developments:
https://github.com/WenkeHuang/Awesome-MLLM-Tuning.
☆ Large Language Models in Bioinformatics: A Survey
Large Language Models (LLMs) are revolutionizing bioinformatics, enabling
advanced analysis of DNA, RNA, proteins, and single-cell data. This survey
provides a systematic review of recent advancements, focusing on genomic
sequence modeling, RNA structure prediction, protein function inference, and
single-cell transcriptomics. Meanwhile, we also discuss several key challenges,
including data scarcity, computational complexity, and cross-omics integration,
and explore future directions such as multimodal learning, hybrid AI models,
and clinical applications. By offering a comprehensive perspective, this paper
underscores the transformative potential of LLMs in driving innovations in
bioinformatics and precision medicine.
☆ Generalized Interpolating Discrete Diffusion
While state-of-the-art language models achieve impressive results through
next-token prediction, they have inherent limitations such as the inability to
revise already generated tokens. This has prompted exploration of alternative
approaches such as discrete diffusion. However, masked diffusion, which has
emerged as a popular choice due to its simplicity and effectiveness,
reintroduces this inability to revise words. To overcome this, we generalize
masked diffusion and derive the theoretical backbone of a family of general
interpolating discrete diffusion (GIDD) processes offering greater flexibility
in the design of the noising processes. Leveraging a novel diffusion ELBO, we
achieve compute-matched state-of-the-art performance in diffusion language
modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining
masking and uniform noise, leading to improved sample quality and unlocking the
ability for the model to correct its own mistakes, an area where autoregressive
models notoriously have struggled. Our code and models are open-source:
https://github.com/dvruette/gidd/
☆ Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification
The need for interpretability in deep learning has driven interest in
counterfactual explanations, which identify minimal changes to an instance that
change a model's prediction. Current counterfactual (CF) generation methods
require task-specific fine-tuning and produce low-quality text. Large Language
Models (LLMs), though effective for high-quality text generation, struggle with
label-flipping counterfactuals (i.e., counterfactuals that change the
prediction) without fine-tuning. We introduce two simple classifier-guided
approaches to support counterfactual generation by LLMs, eliminating the need
for fine-tuning while preserving the strengths of LLMs. Despite their
simplicity, our methods outperform state-of-the-art counterfactual generation
methods and are effective across different LLMs, highlighting the benefits of
guiding counterfactual generation by LLMs with classifier information. We
further show that data augmentation by our generated CFs can improve a
classifier's robustness. Our analysis reveals a critical issue in
counterfactual generation by LLMs: LLMs rely on parametric knowledge rather
than faithfully following the classifier.
☆ Quantifying patterns of punctuation in modern Chinese prose
Recent research shows that punctuation patterns in texts exhibit universal
features across languages. Analysis of Western classical literature reveals
that the distribution of spaces between punctuation marks aligns with a
discrete Weibull distribution, typically used in survival analysis. By
extending this analysis to Chinese literature represented here by three notable
contemporary works, it is shown that Zipf's law applies to Chinese texts
similarly to Western texts, where punctuation patterns also improve adherence
to the law. Additionally, the distance distribution between punctuation marks
in Chinese texts follows the Weibull model, though larger spacing is less
frequent than in English translations. Sentence-ending punctuation,
representing sentence length, diverges more from this pattern, reflecting
greater flexibility in sentence length. This variability supports the formation
of complex, multifractal sentence structures, particularly evident in Gao
Xingjian's "Soul Mountain". These findings demonstrate that both Chinese and
Western texts share universal punctuation and word distribution patterns,
underscoring their broad applicability across languages.
☆ A Dataset for Analysing News Framing in Chinese Media
Framing is an essential device in news reporting, allowing the writer to
influence public perceptions of current affairs. While there are existing
automatic news framing detection datasets in various languages, none of them
focus on news framing in the Chinese language which has complex character
meanings and unique linguistic features. This study introduces the first
Chinese News Framing dataset, to be used as either a stand-alone dataset or a
supplementary resource to the SemEval-2023 task 3 dataset. We detail its
creation and we run baseline experiments to highlight the need for such a
dataset and create benchmarks for future research, providing results obtained
through fine-tuning XLM-RoBERTa-Base and using GPT-4o in the zero-shot setting.
We find that GPT-4o performs significantly worse than fine-tuned XLM-RoBERTa
across all languages. For the Chinese language, we obtain an F1-micro (the
performance metric for SemEval task 3, subtask 2) score of 0.719 using only
samples from our Chinese News Framing dataset and a score of 0.753 when we
augment the SemEval dataset with Chinese news framing samples. With positive
news frame detection results, this dataset is a valuable resource for detecting
news frames in the Chinese language and is a valuable supplement to the
SemEval-2023 task 3 dataset.
☆ Revisiting the Othello World Model Hypothesis ICLR
Li et al. (2023) used the Othello board game as a test case for the ability
of GPT-2 to induce world models, and were followed up by Nanda et al. (2023b).
We briefly discuss the original experiments, expanding them to include more
language models with more comprehensive probing. Specifically, we analyze
sequences of Othello board states and train the model to predict the next move
based on previous moves. We evaluate seven language models (GPT-2, T5, Bart,
Flan-T5, Mistral, LLaMA-2, and Qwen2.5) on the Othello task and conclude that
these models not only learn to play Othello, but also induce the Othello board
layout. We find that all models achieve up to 99% accuracy in unsupervised
grounding and exhibit high similarity in the board features they learned. This
provides considerably stronger evidence for the Othello World Model Hypothesis
than previous works.
comment: ICLR World Models Workshop
☆ Can Large Language Models Predict Antimicrobial Resistance Gene?
This study demonstrates that generative large language models can be utilized
in a more flexible manner for DNA sequence analysis and classification tasks
compared to traditional transformer encoder-based models. While recent
encoder-based models such as DNABERT and Nucleotide Transformer have shown
significant performance in DNA sequence classification, transformer
decoder-based generative models have not yet been extensively explored in this
field. This study evaluates how effectively generative Large Language Models
handle DNA sequences with various labels and analyzes performance changes when
additional textual information is provided. Experiments were conducted on
antimicrobial resistance genes, and the results show that generative Large
Language Models can offer comparable or potentially better predictions,
demonstrating flexibility and accuracy when incorporating both sequence and
textual information. The code and data used in this work are available at the
following GitHub repository: https://github.com/biocomgit/llm4dna.
☆ Comparative Study of Zero-Shot Cross-Lingual Transfer for Bodo POS and NER Tagging Using Gemini 2.0 Flash Thinking Experimental Model
Named Entity Recognition (NER) and Part-of-Speech (POS) tagging are critical
tasks for Natural Language Processing (NLP), yet their availability for
low-resource languages (LRLs) like Bodo remains limited. This article presents
a comparative empirical study investigating the effectiveness of Google's
Gemini 2.0 Flash Thinking Experiment model for zero-shot cross-lingual transfer
of POS and NER tagging to Bodo. We explore two distinct methodologies: (1)
direct translation of English sentences to Bodo followed by tag transfer, and
(2) prompt-based tag transfer on parallel English-Bodo sentence pairs. Both
methods leverage the machine translation and cross-lingual understanding
capabilities of Gemini 2.0 Flash Thinking Experiment to project English POS and
NER annotations onto Bodo text in CONLL-2003 format. Our findings reveal the
capabilities and limitations of each approach, demonstrating that while both
methods show promise for bootstrapping Bodo NLP, prompt-based transfer exhibits
superior performance, particularly for NER. We provide a detailed analysis of
the results, highlighting the impact of translation quality, grammatical
divergences, and the inherent challenges of zero-shot cross-lingual transfer.
The article concludes by discussing future research directions, emphasizing the
need for hybrid approaches, few-shot fine-tuning, and the development of
dedicated Bodo NLP resources to achieve high-accuracy POS and NER tagging for
this low-resource language.
comment: Submitted to SpringerNature MTAP journal. This article has not been
reviewed yet. Submitting for public review!
☆ TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models
Tabular data are crucial in many fields and their understanding by large
language models (LLMs) under high parameter efficiency paradigm is important.
However, directly applying parameter-efficient fine-tuning (PEFT) techniques to
tabular tasks presents significant challenges, particularly in terms of better
table serialization and the representation of two-dimensional structured
information within a one-dimensional sequence. To address this, we propose
TableLoRA, a module designed to improve LLMs' understanding of table structure
during PEFT. It incorporates special tokens for serializing tables with special
token encoder and uses 2D LoRA to encode low-rank information on cell
positions. Experiments on four tabular-related datasets demonstrate that
TableLoRA consistently outperforms vanilla LoRA and surpasses various table
encoding methods tested in control experiments. These findings reveal that
TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process
tabular data effectively, especially in low-parameter settings, demonstrating
its potential as a robust solution for handling table-related tasks.
☆ Shaping Shared Languages: Human and Large Language Models' Inductive Biases in Emergent Communication
Languages are shaped by the inductive biases of their users. Using a
classical referential game, we investigate how artificial languages evolve when
optimised for inductive biases in humans and large language models (LLMs) via
Human-Human, LLM-LLM and Human-LLM experiments. We show that referentially
grounded vocabularies emerge that enable reliable communication in all
conditions, even when humans and LLMs collaborate. Comparisons between
conditions reveal that languages optimised for LLMs subtly differ from those
optimised for humans. Interestingly, interactions between humans and LLMs
alleviate these differences and result in vocabularies which are more
human-like than LLM-like. These findings advance our understanding of how
inductive biases in LLMs play a role in the dynamic nature of human language
and contribute to maintaining alignment in human and machine communication. In
particular, our work underscores the need to think of new methods that include
human interaction in the training processes of LLMs, and shows that using
communicative success as a reward signal can be a fruitful, novel direction.
☆ More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Retrieval-augmented generation (RAG) provides LLMs with relevant documents.
Although previous studies noted that retrieving many documents can degrade
performance, they did not isolate how the quantity of documents affects
performance while controlling for context length. We evaluate various language
models on custom datasets derived from a multi-hop QA task. We keep the context
length and position of relevant information constant while varying the number
of documents, and find that increasing the document count in RAG settings poses
significant challenges for LLMs. Additionally, our results indicate that
processing multiple documents is a separate challenge from handling long
contexts. We also make the datasets and code available:
https://github.com/shaharl6000/MoreDocsSameLen .
comment: Preprint
☆ TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated
text evaluation, where a numerical assessment is assigned by an LLM to the
input text following scoring rubrics. Existing methods for LLM-as-a-judge use
cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of
score prediction. Recent work addresses numerical prediction limitations of LLM
fine-tuning through regression-aware fine-tuning, which, however, does not
consider chain-of-thought (CoT) reasoning for score prediction. In this paper,
we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method
combining CoT reasoning with regression-aware training. TRACT consists of two
stages: first, seed LLM is fine-tuned to generate CoTs, which serve as
supervision for the second stage fine-tuning. The training objective of TRACT
combines the CE loss for learning the CoT reasoning capabilities, and the
regression-aware loss for the score prediction. Experiments across four
LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms
existing methods. Extensive ablation studies validate the importance of each
component in TRACT.
comment: Codes and models are available at https://github.com/d223302/TRACT
☆ Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Daniel Egert, Ellie Evans, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
Inference-Time Scaling has been critical to the success of recent models such
as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for
inference-time scaling require tasks to have answers that can be verified,
limiting their application to domains such as math, coding and logical
reasoning. We take inspiration from how humans make first attempts, ask for
detailed feedback from others and make improvements based on such feedback
across a wide spectrum of open-ended endeavors. To this end, we collect data
for and train dedicated Feedback and Edit Models that are capable of performing
inference-time scaling for open-ended general-domain tasks. In our setup, one
model generates an initial response, which are given feedback by a second
model, that are then used by a third model to edit the response. We show that
performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo
can be boosted by scaling the number of initial response drafts, effective
feedback and edited responses. When scaled optimally, our setup based on 70B
models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7
as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and
DeepSeek R1 with 92.3.
comment: 22 pages, 2 figures
☆ Assumed Identities: Quantifying Gender Bias in Machine Translation of Ambiguous Occupational Terms
Machine Translation (MT) systems frequently encounter ambiguous scenarios
where they must assign gender to certain occupations when translating without
explicit guidance or contextual cues. While individual translations in such
cases may not be inherently biased, systematic patterns-such as the repeated
association of certain professions with specific genders-can emerge, reflecting
and perpetuating societal stereotypes. This ambiguity challenges traditional
instance-level single-answer evaluation approaches, as no single gold standard
translation exists. To address this, we propose an approach that evaluates
gender bias through aggregated model responses. Specifically, we introduce a
methodology to detect gender imbalances between source texts and translations,
a benchmarking dataset with ambiguous English inputs, and probability-based
metrics to quantify a model's divergence from normative standards or reference
distributions.
☆ Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
Large language models (LLMs) have achieved remarkable success in machine
translation, demonstrating impressive performance across diverse languages.
However, translationese, characterized by overly literal and unnatural
translations, remains a persistent challenge in LLM-based translation systems.
Despite their pre-training on vast corpora of natural utterances, LLMs exhibit
translationese errors and generate unexpected unnatural translations, stemming
from biases introduced during supervised fine-tuning (SFT). In this work, we
systematically evaluate the prevalence of translationese in LLM-generated
translations and investigate its roots during supervised training. We introduce
methods to mitigate these biases, including polishing golden references and
filtering unnatural training instances. Empirical evaluations demonstrate that
these approaches significantly reduce translationese while improving
translation naturalness, validated by human evaluations and automatic metrics.
Our findings highlight the need for training-aware adjustments to optimize LLM
translation outputs, paving the way for more fluent and
target-language-consistent translations. We release the data and code at
https://github.com/yafuly/LLM_Translationese.
comment: 19 pages;
☆ Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators
Previous research has shown that LLMs have potential in multilingual NLG
evaluation tasks. However, existing research has not fully explored the
differences in the evaluation capabilities of LLMs across different languages.
To this end, this study provides a comprehensive analysis of the multilingual
evaluation performance of 10 recent LLMs, spanning high-resource and
low-resource languages through correlation analysis, perturbation attacks, and
fine-tuning. We found that 1) excluding the reference answer from the prompt
and using large-parameter LLM-based evaluators leads to better performance
across various languages; 2) most LLM-based evaluators show a higher
correlation with human judgments in high-resource languages than in
low-resource languages; 3) in the languages where they are most sensitive to
such attacks, they also tend to exhibit the highest correlation with human
judgments; and 4) fine-tuning with data from a particular language yields a
broadly consistent enhancement in the model's evaluation performance across
diverse languages. Our findings highlight the imbalance in LLMs'evaluation
capabilities across different languages and suggest that low-resource language
scenarios deserve more attention.
☆ Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling
Zhenghua Wang, Yiran Ding, Changze Lv, Zhibo Xu, Tianlong Li, Tianyuan Shi, Xiaoqing Zheng, Xuanjing Huang
Although large language models (LLMs) have achieved significant progress in
handling long-context inputs, they still suffer from the ``lost-in-the-middle''
problem, where crucial information in the middle of the context is often
underrepresented or lost. Our extensive experiments reveal that this issue may
arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To
address this problem, we propose a layer-specific positional encoding scaling
method that assigns distinct scaling factors to each layer, slowing down the
decay rate caused by RoPE to make the model pay more attention to the middle
context. A specially designed genetic algorithm is employed to efficiently
select the optimal scaling factors for each layer by incorporating Bezier
curves to reduce the search space. Through comprehensive experimentation, we
demonstrate that our method significantly alleviates the ``lost-in-the-middle''
problem. Our approach results in an average accuracy improvement of up to 20%
on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific
interpolation, as opposed to uniform interpolation across all layers, enhances
the model's extrapolation capabilities when combined with PI and Dynamic-NTK
positional encoding schemes.
☆ Adding Alignment Control to Language Models
Post-training alignment has increasingly become a crucial factor in enhancing
the usability of language models (LMs). However, the strength of alignment
varies depending on individual preferences. This paper proposes a method to
incorporate alignment control into a single model, referred to as CLM. This
approach adds one identity layer preceding the initial layers and performs
preference learning only on this layer to map unaligned input token embeddings
into the aligned space. Experimental results demonstrate that this efficient
fine-tuning method performs comparable to full fine-tuning. During inference,
the input embeddings are processed through the aligned and unaligned layers,
which are then merged through the interpolation coefficient. By controlling
this parameter, the alignment exhibits a clear interpolation and extrapolation
phenomenon.
☆ In-depth Analysis of Graph-based RAG in a Unified Framework
Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, Yixiang Fang
Graph-based Retrieval-Augmented Generation (RAG) has proven effective in
integrating external knowledge into large language models (LLMs), improving
their factual accuracy, adaptability, interpretability, and trustworthiness. A
number of graph-based RAG methods have been proposed in the literature.
However, these methods have not been systematically and comprehensively
compared under the same experimental settings. In this paper, we first
summarize a unified framework to incorporate all graph-based RAG methods from a
high-level perspective. We then extensively compare representative graph-based
RAG methods over a range of questing-answering (QA) datasets -- from specific
questions to abstract questions -- and examine the effectiveness of all
methods, providing a thorough analysis of graph-based RAG approaches. As a
byproduct of our experimental analysis, we are also able to identify new
variants of the graph-based RAG methods over specific QA and abstract QA tasks
respectively, by combining existing techniques, which outperform the
state-of-the-art methods. Finally, based on these findings, we offer promising
research opportunities. We believe that a deeper understanding of the behavior
of existing methods can provide new valuable insights for future research.
☆ Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples
Many less-resourced languages struggle with a lack of large, task-specific
datasets that are required for solving relevant tasks with modern
transformer-based large language models (LLMs). On the other hand, many
linguistic resources, such as dictionaries, are rarely used in this context
despite their large information contents. We show how LLMs can be used to
extend existing language resources in less-resourced languages for two
important tasks: word-sense disambiguation (WSD) and word-sense induction
(WSI). We approach the two tasks through the related but much more accessible
word-in-context (WiC) task where, given a pair of sentences and a target word,
a classification model is tasked with predicting whether the sense of a given
word differs between sentences. We demonstrate that a well-trained model for
this task can distinguish between different word senses and can be adapted to
solve the WSD and WSI tasks. The advantage of using the WiC task, instead of
directly predicting senses, is that the WiC task does not need pre-constructed
sense inventories with a sufficient number of examples for each sense, which
are rarely available in less-resourced languages. We show that sentence pairs
for the WiC task can be successfully generated from dictionary examples using
LLMs. The resulting prediction models outperform existing models on WiC, WSD,
and WSI tasks. We demonstrate our methodology on the Slovene language, where a
monolingual dictionary is available, but word-sense resources are tiny.
comment: 12 pages, 1 figure
☆ Computational Law: Datasets, Benchmarks, and Ontologies
Recent developments in computer science and artificial intelligence have also
contributed to the legal domain, as revealed by the number and range of related
publications and applications. Machine and deep learning models require
considerable amount of domain-specific data for training and comparison
purposes, in order to attain high-performance in the legal domain.
Additionally, semantic resources such as ontologies are valuable for building
large-scale computational legal systems, in addition to ensuring
interoperability of such systems. Considering these aspects, we present an
up-to-date review of the literature on datasets, benchmarks, and ontologies
proposed for computational law. We believe that this comprehensive and recent
review will help researchers and practitioners when developing and testing
approaches and systems for computational law.
☆ Dual-Class Prompt Generation: Enhancing Indonesian Gender-Based Hate Speech Detection through Data Augmentation
Detecting gender-based hate speech in Indonesian social media remains
challenging due to limited labeled datasets. While binary hate speech
classification has advanced, a more granular category like gender-targeted hate
speech is understudied because of class imbalance issues. This paper addresses
this gap by comparing three data augmentation techniques for Indonesian
gender-based hate speech detection. We evaluate backtranslation, single-class
prompt generation (using only hate speech examples), and our proposed
dual-class prompt generation (using both hate speech and non-hate speech
examples). Experiments show all augmentation methods improve classification
performance, with our dual-class approach achieving the best results (88.5%
accuracy, 88.1% F1-score using Random Forest). Semantic similarity analysis
reveals dual-class prompt generation produces the most novel content, while
T-SNE visualizations confirm these samples occupy distinct feature space
regions while maintaining class characteristics. Our findings suggest that
incorporating examples from both classes helps language models generate more
diverse yet representative samples, effectively addressing limited data
challenges in specialized hate speech detection.
comment: Accepted to the 8th World Conference on Computing and Communication
Technologies (WCCCT 2025)
☆ On Fact and Frequency: LLM Responses to Misinformation Expressed with Uncertainty
We study LLM judgments of misinformation expressed with uncertainty. Our
experiments study the response of three widely used LLMs (GPT-4o, LlaMA3,
DeepSeek-v2) to misinformation propositions that have been verified false and
then are transformed into uncertain statements according to an uncertainty
typology. Our results show that after transformation, LLMs change their
factchecking classification from false to not-false in 25% of the cases.
Analysis reveals that the change cannot be explained by predictors to which
humans are expected to be sensitive, i.e., modality, linguistic cues, or
argumentation strategy. The exception is doxastic transformations, which use
linguistic cue phrases such as "It is believed ...".To gain further insight, we
prompt the LLM to make another judgment about the transformed misinformation
statements that is not related to truth value. Specifically, we study LLM
estimates of the frequency with which people make the uncertain statement. We
find a small but significant correlation between judgment of fact and
estimation of frequency.
comment: 4 pages, 1 figure, 3 tables, conference
☆ DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Ruizhe Chen, Wenhao Chai, Zhifei Yang, Xiaotian Zhang, Joey Tianyi Zhou, Tony Quek, Soujanya Poria, Zuozhu Liu
Inference-time alignment provides an efficient alternative for aligning LLMs
with humans. However, these approaches still face challenges, such as limited
scalability due to policy-specific value functions and latency during the
inference phase. In this paper, we propose a novel approach, Diffusion-styled
Preference Optimization (\model), which provides an efficient and
policy-agnostic solution for aligning LLMs with humans. By directly performing
alignment at sentence level, \model~avoids the time latency associated with
token-level generation. Designed as a plug-and-play module, \model~can be
seamlessly integrated with various base models to enhance their alignment.
Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that
\model~achieves superior alignment performance across various settings,
achieving a favorable trade-off between alignment quality and inference-time
latency. Furthermore, \model~demonstrates model-agnostic scalability,
significantly improving the performance of large models such as Llama-3-70B.
☆ Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models ACL 2021
In order to deeply understand the capability of pretrained language models in
text generation and conduct a diagnostic evaluation, we propose TGEA, an
error-annotated dataset with multiple benchmark tasks for text generation from
pretrained language models (PLMs). We use carefully selected prompt words to
guide GPT-2 to generate candidate sentences, from which we select 47K for error
annotation. Crowdsourced workers manually check each of these sentences and
detect 12k erroneous sentences. We create an error taxonomy to cover 24 types
of errors occurring in these erroneous sentences according to the nature of
errors with respect to linguistics and knowledge (eg, common sense). For each
erroneous span in PLM-generated sentences, we also detect another span that is
closely associated with it. Each error is hence manually labeled with
comprehensive annotations, including the span of the error, the associated
span, minimal correction to the error, the type of the error, and rationale
behind the error. Apart from the fully annotated dataset, we also present a
detailed description of the data collection procedure, statistics and analysis
of the dataset. This is the first dataset with comprehensive annotations for
PLM-generated texts, which facilitates the diagnostic evaluation of PLM-based
text generation. Furthermore, we use TGEA as a benchmark dataset and propose a
series of automatic diagnosis tasks, including error detection, error type
classification, associated span detection, error rationale generation, to
further promote future study on the automatic error detection and correction on
texts generated by pretrained language models.
comment: ACL 2021
☆ FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
We introduce FuseChat-3.0, a suite of large language models (LLMs) developed
by integrating the strengths of heterogeneous source LLMs into more compact
target LLMs. Our source models include the powerful Gemma-2-27B-it,
Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct.
For target models, we focus on three widely-used smaller
variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along
with two ultra-compact options, Llama-3.2-3B-Instruct and
Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source
models, we develop a specialized data construction protocol tailored to various
tasks and domains. The FuseChat-3.0 training pipeline consists of two key
stages: (1) supervised fine-tuning (SFT) to align the target and source model
distributions, and (2) Direct Preference Optimization (DPO) to apply
preferences from multiple source LLMs to fine-tune the target model. The
resulting FuseChat-3.0 models exhibit significant performance gains across
tasks such as instruction following, general knowledge, mathematics, and
coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target
model, our fusion approach achieves an average improvement of 6.8 points across
14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and
30.1 points on the instruction-following benchmarks AlpacaEval-2 and
Arena-Hard, respectively. Our code, models, and datasets are available at
https://github.com/SLIT-AI/FuseChat-3.0.
comment: Technical report
☆ Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition
Few-shot multimodal dialogue intention recognition is a critical challenge in
the e-commerce domainn. Previous methods have primarily enhanced model
classification capabilities through post-training techniques. However, our
analysis reveals that training for few-shot multimodal dialogue intention
recognition involves two interconnected tasks, leading to a seesaw effect in
multi-task learning. This phenomenon is attributed to knowledge interference
stemming from the superposition of weight matrix updates during the training
process. To address these challenges, we propose Knowledge-Decoupled Synergetic
Learning (KDSL), which mitigates these issues by utilizing smaller models to
transform knowledge into interpretable rules, while applying the post-training
of larger models. By facilitating collaboration between the large and small
multimodal large language models for prediction, our approach demonstrates
significant improvements. Notably, we achieve outstanding results on two real
Taobao datasets, with enhancements of 6.37\% and 6.28\% in online weighted F1
scores compared to the state-of-the-art method, thereby validating the efficacy
of our framework.
☆ Measuring temporal effects of agent knowledge by date-controlled tool use
Temporal progression is an integral part of knowledge accumulation and
update. Web search is frequently adopted as grounding for agent knowledge, yet
its inappropriate configuration affects the quality of agent responses. Here,
we construct a tool-based out-of-sample testing framework to measure the
knowledge variability of large language model (LLM) agents from distinct
date-controlled tools (DCTs). We demonstrate the temporal effects of an LLM
agent as a writing assistant, which can use web search to help complete
scientific publication abstracts. We show that temporal effects of the search
engine translates into tool-dependent agent performance but can be alleviated
with base model choice and explicit reasoning instructions such as
chain-of-thought prompting. Our results indicate that agent evaluation should
take a dynamical view and account for the temporal influence of tools and the
updates of external resources.
comment: comments welcome
☆ Large-Scale AI in Telecom: Charting the Roadmap for Innovation, Scalability, and Enhanced Digital Experiences
Adnan Shahid, Adrian Kliks, Ahmed Al-Tahmeesschi, Ahmed Elbakary, Alexandros Nikou, Ali Maatouk, Ali Mokh, Amirreza Kazemi, Antonio De Domenico, Athanasios Karapantelakis, Bo Cheng, Bo Yang, Bohao Wang, Carlo Fischione, Chao Zhang, Chaouki Ben Issaid, Chau Yuen, Chenghui Peng, Chongwen Huang, Christina Chaccour, Christo Kurisummoottil Thomas, Dheeraj Sharma, Dimitris Kalogiros, Dusit Niyato, Eli De Poorter, Elissa Mhanna, Emilio Calvanese Strinati, Faouzi Bader, Fathi Abdeldayem, Fei Wang, Fenghao Zhu, Gianluca Fontanesi, Giovanni Geraci, Haibo Zhou, Hakimeh Purmehdi, Hamed Ahmadi, Hang Zou, Hongyang Du, Hoon Lee, Howard H. Yang, Iacopo Poli, Igor Carron, Ilias Chatzistefanidis, Inkyu Lee, Ioannis Pitsiorlas, Jaron Fontaine, Jiajun Wu, Jie Zeng, Jinan Li, Jinane Karam, Johny Gemayel, Juan Deng, Julien Frison, Kaibin Huang, Kehai Qiu, Keith Ball, Kezhi Wang, Kun Guo, Leandros Tassiulas, Lecorve Gwenole, Liexiang Yue, Lina Bariah, Louis Powell, Marcin Dryjanski, Maria Amparo Canaveras Galdon, Marios Kountouris, Maryam Hafeez, Maxime Elkael, Mehdi Bennis, Mehdi Boudjelli, Meiling Dai, Merouane Debbah, Michele Polese, Mohamad Assaad, Mohamed Benzaghta, Mohammad Al Refai, Moussab Djerrab, Mubeen Syed, Muhammad Amir, Na Yan, Najla Alkaabi, Nan Li, Nassim Sehad, Navid Nikaein, Omar Hashash, Pawel Sroka, Qianqian Yang, Qiyang Zhao, Rasoul Nikbakht Silab, Rex Ying, Roberto Morabito, Rongpeng Li, Ryad Madi, Salah Eddine El Ayoubi, Salvatore D'Oro, Samson Lasaulce, Serveh Shalmashi, Sige Liu, Sihem Cherrared, Swarna Bindu Chetty, Swastika Dutta, Syed A. R. Zaidi, Tianjiao Chen, Timothy Murphy, Tommaso Melodia, Tony Q. S. Quek, Vishnu Ram, Walid Saad, Wassim Hamidouche, Weilong Chen, Xiaoou Liu, Xiaoxue Yu, Xijun Wang, Xingyu Shang, Xinquan Wang, Xuelin Cao, Yang Su, Yanping Liang, Yansha Deng, Yifan Yang, Yingping Cui, Yu Sun, Yuxuan Chen, Yvan Pointurier, Zeinab Nehme, Zeinab Nezami, Zhaohui Yang, Zhaoyang Zhang, Zhe Liu, Zhenyu Yang, Zhu Han, Zhuang Zhou, Zihan Chen, Zirui Chen, Zitao Shuai
This white paper discusses the role of large-scale AI in the
telecommunications industry, with a specific focus on the potential of
generative AI to revolutionize network functions and user experiences,
especially in the context of 6G systems. It highlights the development and
deployment of Large Telecom Models (LTMs), which are tailored AI models
designed to address the complex challenges faced by modern telecom networks.
The paper covers a wide range of topics, from the architecture and deployment
strategies of LTMs to their applications in network management, resource
allocation, and optimization. It also explores the regulatory, ethical, and
standardization considerations for LTMs, offering insights into their future
integration into telecom infrastructure. The goal is to provide a comprehensive
roadmap for the adoption of LTMs to enhance scalability, performance, and
user-centric innovation in telecom networks.
☆ TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
Large language models (LLMs) have emerged as promising tools for assisting in
medical tasks, yet processing Electronic Health Records (EHRs) presents unique
challenges due to their longitudinal nature. While LLMs' capabilities to
perform medical tasks continue to improve, their ability to reason over
temporal dependencies across multiple patient visits and time frames remains
unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation
for Longitudinal Clinical Records), a framework that incorporate
instruction-response pairs grounding to different parts of a patient's record
as a critical dimension in both instruction evaluation and tuning for
longitudinal clinical records. We develop TIMER-Bench, the first time-aware
benchmark that evaluates temporal reasoning capabilities over longitudinal
EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to
learn reasoning over time. We demonstrate that models fine-tuned with
TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and
9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model
performance for reasoning over EHR.
comment: Preprint
☆ BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions
Chi Hang, Ruiqi Deng, Lavender Yao Jiang, Zihao Yang, Anton Alyakin, Daniel Alber, Eric Karl Oermann
Clinical measurements such as blood pressures and respiration rates are
critical in diagnosing and monitoring patient outcomes. It is an important
component of biomedical data, which can be used to train transformer-based
language models (LMs) for improving healthcare delivery. It is, however,
unclear whether LMs can effectively interpret and use clinical measurements. We
investigate two questions: First, can LMs effectively leverage clinical
measurements to answer related medical questions? Second, how to enhance an
LM's performance on medical question-answering (QA) tasks that involve
measurements? We performed a case study on blood pressure readings (BPs), a
vital sign routinely monitored by medical professionals. We evaluated the
performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly
developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains
$100$ medical QA pairs that were verified by medical students and designed to
rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs)
benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs).
Further, augmenting measurements with labels improves the performance of
BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be
useful for improving domain-specific LMs.
comment: 9 pages
☆ Ticktack : Long Span Temporal Alignment of Large Language Models Leveraging Sexagenary Cycle Time Expression
Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng
Large language models (LLMs) suffer from temporal misalignment issues
especially across long span of time. The issue arises from knowing that LLMs
are trained on large amounts of data where temporal information is rather
sparse over long times, such as thousands of years, resulting in insufficient
learning or catastrophic forgetting by the LLMs. This paper proposes a
methodology named "Ticktack" for addressing the LLM's long-time span
misalignment in a yearly setting. Specifically, we first propose to utilize the
sexagenary year expression instead of the Gregorian year expression employed by
LLMs, achieving a more uniform distribution in yearly granularity. Then, we
employ polar coordinates to model the sexagenary cycle of 60 terms and the year
order within each term, with additional temporal encoding to ensure LLMs
understand them. Finally, we present a temporal representational alignment
approach for post-training LLMs that effectively distinguishes time points with
relevant knowledge, hence improving performance on time-related tasks,
particularly over a long period. We also create a long time span benchmark for
evaluation. Experimental results prove the effectiveness of our proposal.
☆ Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
The rapid evolution of code largelanguage models underscores the need for
effective and transparent benchmarking of their reasoning capabilities.
However, the current benchmarking approach heavily depends on publicly
available, human-created datasets. The widespread use of these fixed benchmark
datasets makes the benchmarking process to be static and thus particularly
susceptible to data contamination, an unavoidable consequence of the extensive
data collection processes used to train Code LLMs. Existing approaches that
address data contamination often suffer from human effort limitations and
imbalanced problem complexity. To tackle these challenges, we propose \tool, a
novel benchmarking suite for evaluating Code LLMs under potential data
contamination. Given a seed programming problem, \tool employs multiple agents
to extract and modify the context without altering the core logic, generating
semantically equivalent variations. We introduce a dynamic data generation
methods and conduct empirical studies on two seed datasets across 21 Code LLMs.
Results show that \tool effectively benchmarks reasoning capabilities under
contamination risks while generating diverse problem sets to ensure consistent
and reliable evaluations.
comment: https://codekaleidoscope.github.io/dycodeeval.html
☆ HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs NAACL 2025
The growth of conversational AI services has increased demand for effective
information retrieval from dialogue data. However, existing methods often face
challenges in capturing semantic intent or require extensive labeling and
fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted
Semantic Indexing for Retrieval), a novel framework that enhances semantic
understanding in conversational data retrieval through optimized data
ingestion, eliminating the need for resource-intensive labeling or model
adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets
Formulation and (2) Adjunct Augmentation, creating semantic indices consisting
of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured
representation effectively captures the underlying semantic information from
dialogue content. HEISIR achieves high retrieval performance while maintaining
low latency during the actual retrieval process. Our experimental results
demonstrate that HEISIR outperforms fine-tuned models across various embedding
types and language models. Beyond improving retrieval capabilities, HEISIR also
offers opportunities for intent and topic analysis in conversational data,
providing a versatile solution for dialogue systems.
comment: Accepted by NAACL 2025 (Findings)
☆ Biological Sequence with Language Model Prompting: A Survey
Large Language models (LLMs) have emerged as powerful tools for addressing
challenges across diverse domains. Notably, recent studies have demonstrated
that large language models significantly enhance the efficiency of biomolecular
analysis and synthesis, attracting widespread attention from academics and
medicine. In this paper, we systematically investigate the application of
prompt-based methods with LLMs to biological sequences, including DNA, RNA,
proteins, and drug discovery tasks. Specifically, we focus on how prompt
engineering enables LLMs to tackle domain-specific problems, such as promoter
sequence prediction, protein structure modeling, and drug-target binding
affinity prediction, often with limited labeled data. Furthermore, our
discussion highlights the transformative potential of prompting in
bioinformatics while addressing key challenges such as data scarcity,
multimodal fusion, and computational resource limitations. Our aim is for this
paper to function both as a foundational primer for newcomers and a catalyst
for continued innovation within this dynamic field of study.
☆ Uncovering Gaps in How Humans and LLMs Interpret Subjective Language ICLR 2025
Humans often rely on subjective natural language to direct language models
(LLMs); for example, users might instruct the LLM to write an enthusiastic
blogpost, while developers might train models to be helpful and harmless using
LLM-based edits. The LLM's operational semantics of such subjective phrases --
how it adjusts its behavior when each phrase is included in the prompt -- thus
dictates how aligned it is with human intent. In this work, we uncover
instances of misalignment between LLMs' actual operational semantics and what
humans expect. Our method, TED (thesaurus error detector), first constructs a
thesaurus that captures whether two phrases have similar operational semantics
according to the LLM. It then elicits failures by unearthing disagreements
between this thesaurus and a human-constructed reference. TED routinely
produces surprising instances of misalignment; for example, Mistral 7B Instruct
produces more harassing outputs when it edits text to be witty, and Llama 3 8B
Instruct produces dishonest articles when instructed to make the articles
enthusiastic. Our results demonstrate that humans can uncover unexpected LLM
behavior by scrutinizing relationships between abstract concepts, without
supervising outputs directly.
comment: Published at ICLR 2025
☆ LLMs Can Generate a Better Answer by Aggregating Their Own Responses
Zichong Li, Xinyu Feng, Yuheng Cai, Zixuan Zhang, Tianyi Liu, Chen Liang, Weizhu Chen, Haoyu Wang, Tuo Zhao
Large Language Models (LLMs) have shown remarkable capabilities across tasks,
yet they often require additional prompting techniques when facing complex
problems. While approaches like self-correction and response selection have
emerged as popular solutions, recent studies have shown these methods perform
poorly when relying on the LLM itself to provide feedback or selection
criteria. We argue this limitation stems from the fact that common LLM
post-training procedures lack explicit supervision for discriminative judgment
tasks. In this paper, we propose Generative Self-Aggregation (GSA), a novel
prompting method that improves answer quality without requiring the model's
discriminative capabilities. GSA first samples multiple diverse responses from
the LLM, then aggregates them to obtain an improved solution. Unlike previous
approaches, our method does not require the LLM to correct errors or compare
response quality; instead, it leverages the model's generative abilities to
synthesize a new response based on the context of multiple samples. While GSA
shares similarities with the self-consistency (SC) approach for response
aggregation, SC requires specific verifiable tokens to enable majority voting.
In contrast, our approach is more general and can be applied to open-ended
tasks. Empirical evaluation demonstrates that GSA effectively improves response
quality across various tasks, including mathematical reasoning, knowledge-based
problems, and open-ended generation tasks such as code synthesis and
conversational responses.
☆ Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English
Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen
Large Language Models (LLMs) have demonstrated remarkable capabilities in
reasoning tasks, leading to their widespread deployment. However, recent
studies have highlighted concerning biases in these models, particularly in
their handling of dialectal variations like African American English (AAE). In
this work, we systematically investigate dialectal disparities in LLM reasoning
tasks. We develop an experimental framework comparing LLM performance given
Standard American English (SAE) and AAE prompts, combining LLM-based dialect
conversion with established linguistic analyses. We find that LLMs consistently
produce less accurate responses and simpler reasoning chains and explanations
for AAE inputs compared to equivalent SAE questions, with disparities most
pronounced in social science and humanities domains. These findings highlight
systematic differences in how LLMs process and reason about different language
varieties, raising important questions about the development and deployment of
these systems in our multilingual and multidialectal world. Our code repository
is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.
comment: ARR Under Review, First two authors contribute equally
☆ Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts
Xiangnan Chen, Yuancheng Fang, Qian Xiao, Juncheng Li, Jun Lin, Siliang Tang, Yi Yang, Yueting Zhuang
Multimodal Large Language Models (MLLMs) have garnered significant attention
for their strong visual-semantic understanding. Most existing chart benchmarks
evaluate MLLMs' ability to parse information from charts to answer
questions.However, they overlook the inherent output biases of MLLMs, where
models rely on their parametric memory to answer questions rather than
genuinely understanding the chart content. To address this limitation, we
introduce a novel Chart Hypothetical Question Answering (HQA) task, which
imposes assumptions on the same question to compel models to engage in
counterfactual reasoning based on the chart content. Furthermore, we introduce
HAI, a human-AI interactive data synthesis approach that leverages the
efficient text-editing capabilities of LLMs alongside human expert knowledge to
generate diverse and high-quality HQA data at a low cost. Using HAI, we
construct Chart-HQA, a challenging benchmark synthesized from publicly
available data sources. Evaluation results on 18 MLLMs of varying model sizes
reveal that current models face significant generalization challenges and
exhibit imbalanced reasoning performance on the HQA task.
comment: Under review
☆ PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
With the rapid advancement of digitalization, various document images are
being applied more extensively in production and daily life, and there is an
increasingly urgent need for fast and accurate parsing of the content in
document images. Therefore, this report presents PP-DocBee, a novel multimodal
large language model designed for end-to-end document image understanding.
First, we develop a data synthesis strategy tailored to document scenarios in
which we build a diverse dataset to improve the model generalization. Then, we
apply a few training techniques, including dynamic proportional sampling, data
preprocessing, and OCR postprocessing strategies. Extensive evaluations
demonstrate the superior performance of PP-DocBee, achieving state-of-the-art
results on English document understanding benchmarks and even outperforming
existing open source and commercial models in Chinese document understanding.
The source code and pre-trained models are publicly available at
\href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
☆ Uncovering inequalities in new knowledge learning by large language models across different languages
Chenglong Wang, Haoyu Tang, Xiyuan Yang, Yueqi Xie, Jina Suh, Sunayana Sitaram, Junming Huang, Yu Xie, Zhaoya Gong, Xing Xie, Fangzhao Wu
As large language models (LLMs) gradually become integral tools for problem
solving in daily life worldwide, understanding linguistic inequality is
becoming increasingly important. Existing research has primarily focused on
static analyses that assess the disparities in the existing knowledge and
capabilities of LLMs across languages. However, LLMs are continuously evolving,
acquiring new knowledge to generate up-to-date, domain-specific responses.
Investigating linguistic inequalities within this dynamic process is,
therefore, also essential. In this paper, we explore inequalities in new
knowledge learning by LLMs across different languages and four key dimensions:
effectiveness, transferability, prioritization, and robustness. Through
extensive experiments under two settings (in-context learning and fine-tuning)
using both proprietary and open-source models, we demonstrate that low-resource
languages consistently face disadvantages across all four dimensions. By
shedding light on these disparities, we aim to raise awareness of linguistic
inequalities in LLMs' new knowledge learning, fostering the development of more
inclusive and equitable future LLMs.
☆ Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Data watermarking in language models injects traceable signals, such as
specific token sequences or stylistic patterns, into copyrighted text, allowing
copyright holders to track and verify training data ownership. Previous data
watermarking techniques primarily focus on effective memorization after
pretraining, while overlooking challenges that arise in other stages of the LLM
pipeline, such as the risk of watermark filtering during data preprocessing, or
potential forgetting through post-training, or verification difficulties due to
API-only access. We propose a novel data watermarking approach that injects
coherent and plausible yet fictitious knowledge into training data using
generated passages describing a fictitious entity and its associated
attributes. Our watermarks are designed to be memorized by the LLM through
seamlessly integrating in its training data, making them harder to detect
lexically during preprocessing.We demonstrate that our watermarks can be
effectively memorized by LLMs, and that increasing our watermarks' density,
length, and diversity of attributes strengthens their memorization. We further
show that our watermarks remain robust throughout LLM development, maintaining
their effectiveness after continual pretraining and supervised finetuning.
Finally, we show that our data watermarks can be evaluated even under API-only
access via question answering.
☆ Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting
Jiyue Jiang, Pengan Chen, Jiuming Wang, Dongchen He, Ziqin Wei, Liang Hong, Licheng Zong, Sheng Wang, Qinze Yu, Zixian Ma, Yanyu Chen, Yimin Fan, Xiangyu Shi, Jiawei Sun, Chuan Wu, Yu Li
Large language models (LLMs) have become important tools in solving
biological problems, offering improvements in accuracy and adaptability over
conventional methods. Several benchmarks have been proposed to evaluate the
performance of these LLMs. However, current benchmarks can hardly evaluate the
performance of these models across diverse tasks effectively. In this paper, we
introduce a comprehensive prompting-based benchmarking framework, termed
Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such
as proteins, RNA, drugs, electronic health records, and traditional Chinese
medicine. Using this benchmark, we evaluate six mainstream LLMs, including
GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought
(CoT) settings without fine-tuning to reveal their intrinsic capabilities. To
improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool
for extracting answers from LLM responses, which increases extraction accuracy
by round 30% compared to existing methods. Our benchmark results show the
biological tasks suitable for current LLMs and identify specific areas
requiring enhancement. Furthermore, we propose targeted prompt engineering
strategies for optimizing LLM performance in these contexts. Based on these
findings, we provide recommendations for the development of more robust LLMs
tailored for various biological applications. This work offers a comprehensive
evaluation framework and robust tools to support the application of LLMs in
bioinformatics.
☆ RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang
Recently, Multimodal Large Language Models (MLLMs) have gained significant
attention for their remarkable ability to process and analyze non-textual data,
such as images, videos, and audio. Notably, several adaptations of
general-domain MLLMs to the medical field have been explored, including
LLaVA-Med. However, these medical adaptations remain insufficiently advanced in
understanding and interpreting retinal images. In contrast, medical experts
emphasize the importance of quantitative analyses for disease detection and
interpretation. This underscores a gap between general-domain and
medical-domain MLLMs: while general-domain MLLMs excel in broad applications,
they lack the specialized knowledge necessary for precise diagnostic and
interpretative tasks in the medical field. To address these challenges, we
introduce \textit{RetinalGPT}, a multimodal conversational assistant for
clinically preferred quantitative analysis of retinal images. Specifically, we
achieve this by compiling a large retinal image dataset, developing a novel
data pipeline, and employing customized visual instruction tuning to enhance
both retinal analysis and enrich medical knowledge. In particular, RetinalGPT
outperforms MLLM in the generic domain by a large margin in the diagnosis of
retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis,
RetinalGPT features quantitative analyses and lesion localization, representing
a pioneering step in leveraging LLMs for an interpretable and end-to-end
clinical research framework. The code is available at
https://github.com/Retinal-Research/RetinalGPT
☆ Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro
Understanding and reasoning over non-speech sounds and music are crucial for
both humans and AI agents to interact effectively with their environments. In
this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM)
with advanced audio understanding and reasoning capabilities. AF2 leverages (i)
a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio
reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves
state-of-the-art performance with only a 3B parameter small language model,
surpassing large open-source and proprietary models across over 20 benchmarks.
Next, for the first time, we extend audio understanding to long audio segments
(30 secs to 5 mins) and propose LongAudio, a large and novel dataset for
training ALMs on long audio captioning and question-answering tasks.
Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed
LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio
understanding capabilities. We conduct extensive ablation studies to confirm
the efficacy of our approach. Project Website:
https://research.nvidia.com/labs/adlr/AF2/.
☆ ReasonGraph: Visualisation of Reasoning Paths
Large Language Models (LLMs) reasoning processes are challenging to analyze
due to their complexity and the lack of organized visualization tools. We
present ReasonGraph, a web-based platform for visualizing and analyzing LLM
reasoning processes. It supports both sequential and tree-based reasoning
methods while integrating with major LLM providers and over fifty
state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta
reasoning method selection, configurable visualization parameters, and a
modular framework that facilitates efficient extension. Our evaluation shows
high parsing reliability, efficient processing, and strong usability across
various downstream applications. By providing a unified visualization
framework, ReasonGraph reduces cognitive load in analyzing complex reasoning
paths, improves error detection in logical processes, and enables more
effective development of LLM-based applications. The platform is open-source,
promoting accessibility and reproducibility in LLM reasoning analysis.
♻ ☆ How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments ICLR 2025
Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu
Decision-making is a complex process requiring diverse abilities, making it
an excellent framework for evaluating Large Language Models (LLMs). Researchers
have examined LLMs' decision-making through the lens of Game Theory. However,
existing evaluation mainly focus on two-player scenarios where an LLM competes
against another. Additionally, previous benchmarks suffer from test set leakage
due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework
for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes
eight classical game theory scenarios and a dynamic scoring scheme specially
designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows
flexible game settings and adapts the scoring system to different game
parameters, enabling comprehensive evaluation of robustness, generalizability,
and strategies for improvement. Our results indicate that GPT-3.5 demonstrates
strong robustness but limited generalizability, which can be enhanced using
methods like Chain-of-Thought. We also evaluate 13 LLMs from 6 model families,
including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2.
Gemini-1.5-Pro outperforms others, scoring of $69.8$ out of $100$, followed by
LLaMA-3.1-70B ($65.9$) and Mixtral-8x22B ($62.4$). Our code and experimental
results are publicly available at https://github.com/CUHK-ARISE/GAMABench.
comment: Accepted to ICLR 2025; 11 pages of main text; 26 pages of appendices;
Included models: GPT-3.5-{0613, 1106, 0125}, GPT-4-0125, GPT-4o-0806,
Gemini-{1.0, 1.5)-Pro, LLaMA-3.1-{7, 70, 405}B, Mixtral-8x{7, 22}B,
Qwen-2-72B
♻ ☆ HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly ICLR 2025
Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, Danqi Chen
Many benchmarks exist for evaluating long-context language models (LCLMs),
yet developers often rely on synthetic tasks such as needle-in-a-haystack
(NIAH) or an arbitrary subset of tasks. However, it remains unclear whether
these benchmarks reflect the diverse downstream applications of LCLMs, and such
inconsistencies further complicate model comparison. We investigate the
underlying reasons behind these practices and find that existing benchmarks
often provide noisy signals due to limited coverage of applications,
insufficient context lengths, unreliable metrics, and incompatibility with base
models. In this work, we introduce HELMET (How to Evaluate Long-context Models
Effectively and Thoroughly), a comprehensive benchmark encompassing seven
diverse, application-centric categories. We also address several issues in
previous benchmarks by adding controllable lengths up to 128K tokens,
model-based evaluation for reliable metrics, and few-shot prompting for
robustly evaluating base models. Consequently, we demonstrate that HELMET
offers more reliable and consistent rankings of frontier LCLMs. Through a
comprehensive study of 59 LCLMs, we find that (1) synthetic tasks like NIAH do
not reliably predict downstream performance; (2) the diverse categories in
HELMET exhibit distinct trends and low correlations with each other; and (3)
while most LCLMs achieve perfect NIAH scores, open-source models significantly
lag behind closed ones when tasks require full-context reasoning or following
complex instructions -- the gap widens as length increases. Finally, we
recommend using our RAG tasks for fast model development, as they are easy to
run and better predict other downstream performance; ultimately, we advocate
for a holistic evaluation across diverse tasks.
comment: ICLR 2025. Project page: https://princeton-nlp.github.io/HELMET/
♻ ☆ AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement ICRA
Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
An embodied agent assisting humans is often asked to complete new tasks, and
there may not be sufficient time or labeled examples to train the agent to
perform these new tasks. Large Language Models (LLMs) trained on considerable
knowledge across many domains can be used to predict a sequence of abstract
actions for completing such tasks, although the agent may not be able to
execute this sequence due to task-, agent-, or domain-specific constraints. Our
framework addresses these challenges by leveraging the generic predictions
provided by LLM and the prior domain knowledge encoded in a Knowledge Graph
(KG), enabling an agent to quickly adapt to new tasks. The robot also solicits
and uses human input as needed to refine its existing knowledge. Based on
experimental evaluation in the context of cooking and cleaning tasks in
simulation domains, we demonstrate that the interplay between LLM, KG, and
human input leads to substantial performance gains compared with just using the
LLM. Project website{\S}: https://sssshivvvv.github.io/adaptbot/
comment: Accepted to IEEE International Conference on Robotics and Automation
(ICRA) 2025
♻ ☆ Diagnosing Moral Reasoning Acquisition in Language Models: Pragmatics and Generalization
Ensuring that Large Language Models (LLMs) return just responses which adhere
to societal values is crucial for their broader application. Prior research has
shown that LLMs often fail to perform satisfactorily on tasks requiring moral
cognizance, such as ethics-based judgments. While current approaches have
focused on fine-tuning LLMs with curated datasets to improve their capabilities
on such tasks, choosing the optimal learning paradigm to enhance the ethical
responses of LLMs remains an open research debate. In this work, we aim to
address this fundamental question: can current learning paradigms enable LLMs
to acquire sufficient moral reasoning capabilities? Drawing from distributional
semantics theory and the pragmatic nature of moral discourse, our analysis
indicates that performance improvements follow a mechanism similar to that of
semantic-level tasks, and therefore remain affected by the pragmatic nature of
morals latent in discourse, a phenomenon we name the pragmatic dilemma. We
conclude that this pragmatic dilemma imposes significant limitations on the
generalization ability of current learning paradigms, making it the primary
bottleneck for moral reasoning acquisition in LLMs.
♻ ☆ Get my drift? Catching LLM Task Drift with Activation Deltas
LLMs are commonly used in retrieval-augmented applications to execute user
instructions based on data from external sources. For example, modern search
engines use LLMs to answer queries based on relevant search results; email
plugins summarize emails by processing their content through an LLM. However,
the potentially untrusted provenance of these data sources can lead to prompt
injection attacks, where the LLM is manipulated by natural language
instructions embedded in the external data, causing it to deviate from the
user's original instruction(s). We define this deviation as task drift. Task
drift is a significant concern as it allows attackers to exfiltrate data or
influence the LLM's output for other users. We study LLM activations as a
solution to detect task drift, showing that activation deltas - the difference
in activations before and after processing external data - are strongly
correlated with this phenomenon. Through two probing methods, we demonstrate
that a simple linear classifier can detect drift with near-perfect ROC AUC on
an out-of-distribution test set. We evaluate these methods by making minimal
assumptions about how users' tasks, system prompts, and attacks can be phrased.
We observe that this approach generalizes surprisingly well to unseen task
domains, such as prompt injections, jailbreaks, and malicious instructions,
without being trained on any of these attacks. Interestingly, the fact that
this solution does not require any modifications to the LLM (e.g.,
fine-tuning), as well as its compatibility with existing meta-prompting
solutions, makes it cost-efficient and easy to deploy. To encourage further
research on activation-based task inspection, decoding, and interpretability,
we release our large-scale TaskTracker toolkit, featuring a dataset of over
500K instances, representations from six SoTA language models, and a suite of
inspection tools.
comment: SaTML 2025
♻ ☆ ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration
Large language models (LLMs) have demonstrated a remarkable ability to serve
as general-purpose tools for various language-based tasks. Recent works have
demonstrated that the efficacy of such models can be improved through iterative
dialog between multiple models. While these paradigms show promise in improving
model efficacy, most works in this area treat collaboration as an emergent
behavior, rather than a learned behavior. In doing so, current multi-agent
frameworks rely on collaborative behaviors to have been sufficiently trained
into off-the-shelf models. To address this limitation, we propose ACC-Collab,
an Actor-Critic based learning framework to produce a two-agent team (an
actor-agent and a critic-agent) specialized in collaboration. We demonstrate
that ACC-Collab outperforms SotA multi-agent techniques on a wide array of
benchmarks.
♻ ☆ LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacs, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
Assessing the reasoning capabilities of large language models (LLMs) is
susceptible to overestimation due to data exposure of evaluation benchmarks. We
introduce a framework for producing linguistic reasoning problems that reduces
the effect of memorisation in model performance estimates and apply this
framework to develop LINGOLY-TOO, a challenging benchmark for linguistic
reasoning. By developing orthographic templates, we dynamically obfuscate the
writing systems of real languages to generate numerousquestion variations.
These variations preserve the reasoning steps required for each solution while
reducing the likelihood of specific problem instances appearing in model
training data. Our experiments demonstrate that frontier models, including
Claud 3.7 Sonnet, o1-preview and DeepSeek R1, struggle with advanced reasoning.
Our analysis also shows that LLMs exhibit noticeable variance in accuracy
across permutations of the same problem, and on average perform better on
questions appearing in their original orthography. Our findings highlight the
opaque nature of response generation in LLMs and provide evidence that prior
data exposure contributes to over estimating the reasoning capabilities of
frontier models.
♻ ☆ Protein Large Language Models: A Comprehensive Survey
Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Protein-specific large language models (Protein LLMs) are revolutionizing
protein science by enabling more efficient protein structure prediction,
function annotation, and design. While existing surveys focus on specific
aspects or applications, this work provides the first comprehensive overview of
Protein LLMs, covering their architectures, training datasets, evaluation
metrics, and diverse applications. Through a systematic analysis of over 100
articles, we propose a structured taxonomy of state-of-the-art Protein LLMs,
analyze how they leverage large-scale protein sequence data for improved
accuracy, and explore their potential in advancing protein engineering and
biomedical research. Additionally, we discuss key challenges and future
directions, positioning Protein LLMs as essential tools for scientific
discovery in protein science. Resources are maintained at
https://github.com/Yijia-Xiao/Protein-LLM-Survey.
comment: 24 pages, 4 figures, 5 tables
♻ ☆ NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models NAACL 2025
To be effectively and safely deployed to global user populations, large
language models (LLMs) may need to adapt outputs to user values and cultures,
not just know about them. We introduce NormAd, an evaluation framework to
assess LLMs' cultural adaptability, specifically measuring their ability to
judge social acceptability across varying levels of cultural norm specificity,
from abstract values to explicit social norms. As an instantiation of our
framework, we create NormAd-Eti, a benchmark of 2.6k situational descriptions
representing social-etiquette related cultural norms from 75 countries. Through
comprehensive experiments on NormAd-Eti, we find that LLMs struggle to
accurately judge social acceptability across these varying degrees of cultural
contexts and show stronger adaptability to English-centric cultures over those
from the Global South. Even in the simplest setting where the relevant social
norms are provided, the best LLMs' performance (< 82\%) lags behind humans (>
95\%). In settings with abstract values and country information, model
performance drops substantially (< 60\%), while human accuracy remains high (>
90\%). Furthermore, we find that models are better at recognizing socially
acceptable versus unacceptable situations. Our findings showcase the current
pitfalls in socio-cultural reasoning of LLMs which hinder their adaptability
for global audiences.
comment: Accepted at NAACL 2025
♻ ☆ $\texttt{SEM-CTRL}$: Semantically Controlled Decoding
Ensuring both syntactic and semantic correctness in Large Language Model
(LLM) outputs remains a significant challenge, despite being critical for
real-world deployment. In this paper, we introduce $\texttt{SEM-CTRL}$, a
unified approach that enforces rich context-sensitive constraints and task- and
instance-specific semantics directly on an LLM decoder. Our approach integrates
token-level MCTS, which is guided by specific syntactic and semantic
constraints. The constraints over the desired outputs are expressed using
Answer Set Grammars -- a logic-based formalism that generalizes
context-sensitive grammars while incorporating background knowledge to
represent task-specific semantics. We show that our approach guarantees correct
completions for any off-the-shelf LLM without the need for fine-tuning. We
evaluate $\texttt{SEM-CTRL}$ on a range of tasks, including synthetic grammar
synthesis, combinatorial reasoning, and planning. Our results demonstrate that
$\texttt{SEM-CTRL}$ allows small pre-trained LLMs to efficiently outperform
larger variants and state-of-the-art reasoning models (e.g., o1-preview) while
simultaneously guaranteeing solution correctness.
♻ ☆ Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution ICLR 2025
Probing learned concepts in large language models (LLMs) is crucial for
understanding how semantic knowledge is encoded internally. Training linear
classifiers on probing tasks is a principle approach to denote the vector of a
certain concept in the representation space. However, the single vector
identified for a concept varies with both data and training, making it less
robust and weakening its effectiveness in real-world applications. To address
this challenge, we propose an approach to approximate the subspace representing
a specific concept. Built on linear probing classifiers, we extend the concept
vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's
effectiveness through measuring its faithfulness and plausibility across
multiple LLMs with different sizes and architectures. Additionally, we use
representation intervention tasks to showcase its efficacy in real-world
applications such as emotion steering. Experimental results indicate that GCS
concept vectors have the potential to balance steering performance and
maintaining the fluency in natural language generation tasks.
comment: Accepted by ICLR 2025
♻ ☆ X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
Despite the rapid development of safety alignment techniques for LLMs,
defending against multi-turn jailbreaks is still a challenging task. In this
paper, we conduct a comprehensive comparison, revealing that some existing
defense methods can improve the robustness of LLMs against multi-turn
jailbreaks but compromise usability, i.e., reducing general capabilities or
causing the over-refusal problem. From the perspective of mechanism
interpretability of LLMs, we discover that these methods fail to establish a
boundary that exactly distinguishes safe and harmful feature representations.
Therefore, boundary-safe representations close to harmful representations are
inevitably disrupted, leading to a decline in usability. To address this issue,
we propose X-Boundary to push harmful representations away from boundary-safe
representations and obtain an exact distinction boundary. In this way, harmful
representations can be precisely erased without disrupting safe ones.
Experimental results show that X-Boundary achieves state-of-the-art defense
performance against multi-turn jailbreaks, while reducing the over-refusal rate
by about 20% and maintaining nearly complete general capability. Furthermore,
we theoretically prove and empirically verify that X-Boundary can accelerate
the convergence process during training. Please see our code at:
https://github.com/AI45Lab/X-Boundary.
♻ ☆ UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a
given nominal compound that may carry idiomatic meaning in both English and
Brazilian Portuguese. To address this challenge, this work uses generative
large language models (LLMs) and multilingual CLIP models to enhance idiomatic
compound representations. LLMs generate idiomatic meanings for potentially
idiomatic compounds, enriching their semantic interpretation. These meanings
are then encoded using multilingual CLIP models, serving as representations for
image ranking. Contrastive learning and data augmentation techniques are
applied to fine-tune these embeddings for improved performance. Experimental
results show that multimodal representations extracted through this method
outperformed those based solely on the original nominal compounds. The
fine-tuning approach shows promising outcomes but is less effective than using
embeddings without fine-tuning. The source code used in this paper is available
at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
♻ ☆ Gumbel Counterfactual Generation From Language Models ICLR 2025
Understanding and manipulating the causal generation mechanisms in language
models is essential for controlling their behavior. Previous work has primarily
relied on techniques such as representation surgery -- e.g., model ablations or
manipulation of linear subspaces tied to specific concepts -- to
\emph{intervene} on these models. To understand the impact of interventions
precisely, it is useful to examine \emph{counterfactuals} -- e.g., how a given
sentence would have appeared had it been generated by the model following a
specific intervention. We highlight that counterfactual reasoning is
conceptually distinct from interventions, as articulated in Pearl's causal
hierarchy. Based on this observation, we propose a framework for generating
true string counterfactuals by reformulating language models as a structural
equation model using the Gumbel-max trick, which we called Gumbel
counterfactual generation. This reformulation allows us to model the joint
distribution over original strings and their counterfactuals resulting from the
same instantiation of the sampling noise. We develop an algorithm based on
hindsight Gumbel sampling that allows us to infer the latent noise variables
and generate counterfactuals of observed strings. Our experiments demonstrate
that the approach produces meaningful counterfactuals while at the same time
showing that commonly used intervention techniques have considerable undesired
side effects.
comment: Accepted in ICLR 2025
♻ ☆ Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models ICLR 2025
Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, Max Bartolo
The capabilities and limitations of Large Language Models have been sketched
out in great detail in recent years, providing an intriguing yet conflicting
picture. On the one hand, LLMs demonstrate a general ability to solve problems.
On the other hand, they show surprising reasoning gaps when compared to humans,
casting doubt on the robustness of their generalisation strategies. The sheer
volume of data used in the design of LLMs has precluded us from applying the
method traditionally used to measure generalisation: train-test set separation.
To overcome this, we study what kind of generalisation strategies LLMs employ
when performing reasoning tasks by investigating the pretraining data they rely
on. For two models of different sizes (7B and 35B) and 2.5B of their
pretraining tokens, we identify what documents influence the model outputs for
three simple mathematical reasoning tasks and contrast this to the data that
are influential for answering factual questions. We find that, while the models
rely on mostly distinct sets of data for each factual question, a document
often has a similar influence across different reasoning questions within the
same task, indicating the presence of procedural knowledge. We further find
that the answers to factual questions often show up in the most influential
data. However, for reasoning questions the answers usually do not show up as
highly influential, nor do the answers to the intermediate reasoning steps.
When we characterise the top ranked documents for the reasoning questions
qualitatively, we confirm that the influential documents often contain
procedural knowledge, like demonstrating how to obtain a solution using
formulae or code. Our findings indicate that the approach to reasoning the
models use is unlike retrieval, and more like a generalisable strategy that
synthesises procedural knowledge from documents doing a similar form of
reasoning.
comment: Published at ICLR 2025
♻ ☆ Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners
Generative artificial intelligence (AI) chatbots, such as ChatGPT, are
reshaping how English as a foreign language (EFL) students write since students
can compose texts by integrating their own words with AI-generated text. This
study investigated how 59 Hong Kong secondary school students with varying
levels of academic achievement interacted with AI-generated text to compose a
feature article, exploring whether any interaction patterns benefited the
overall quality of the article. Through content analysis, multiple linear
regression and cluster analysis, we found the overall number of words --
whether AI- or human-generated -- is the main predictor of writing quality.
However, the impact varies by students' competence to write independently, for
instance, by using their own words accurately and coherently to compose a text,
and to follow specific interaction patterns with AI-generated text. Therefore,
although composing texts with human words and AI-generated text may become
prevalent in EFL writing classrooms, without educators' careful attention to
EFL writing pedagogy and AI literacy, high-achieving students stand to benefit
more from using AI-generated text than low-achieving students.
♻ ☆ Assisting Mathematical Formalization with A Learning-based Premise Retriever
Premise selection is a crucial yet challenging step in mathematical
formalization, especially for users with limited experience. Due to the lack of
available formalization projects, existing approaches that leverage language
models often suffer from data scarcity. In this work, we introduce an
innovative method for training a premise retriever to support the formalization
of mathematics. Our approach employs a BERT model to embed proof states and
premises into a shared latent space. The retrieval model is trained within a
contrastive learning framework and incorporates a domain-specific tokenizer
along with a fine-grained similarity computation method. Experimental results
show that our model is highly competitive compared to existing baselines,
achieving strong performance while requiring fewer computational resources.
Performance is further enhanced through the integration of a re-ranking module.
To streamline the formalization process, we will release a search engine that
enables users to query Mathlib theorems directly using proof states,
significantly improving accessibility and efficiency. Codes are available at
https://github.com/ruc-ai4math/Premise-Retrieval.
♻ ☆ AfroBench: How Good are Large Language Models on African Languages?
Jessica Ojo, Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji, Jimmy Lin, Pontus Stenetorp, David Ifeoluwa Adelani
Large-scale multilingual evaluations, such as MEGA, often include only a
handful of African languages due to the scarcity of high-quality evaluation
data and the limited discoverability of existing African datasets. This lack of
representation hinders comprehensive LLM evaluation across a diverse range of
languages and tasks. To address these challenges, we introduce AfroBench -- a
multi-task benchmark for evaluating the performance of LLMs across 64 African
languages, 15 tasks and 22 datasets. AfroBench consists of nine natural
language understanding datasets, six text generation datasets, six knowledge
and question answering tasks, and one mathematical reasoning task. We present
results comparing the performance of prompting LLMs to fine-tuned baselines
based on BERT and T5-style models. Our results suggest large gaps in
performance between high-resource languages, such as English, and African
languages across most tasks; but performance also varies based on the
availability of monolingual data resources. Our findings confirm that
performance on African languages continues to remain a hurdle for current LLMs,
underscoring the need for additional efforts to close this gap.
https://mcgill-nlp.github.io/AfroBench/
comment: Under review
♻ ☆ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI NeurIPS 2024
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
The evolution of Artificial Intelligence (AI) has been significantly
accelerated by advancements in Large Language Models (LLMs) and Large
Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning
abilities in problem-solving and scientific discovery (i.e., AI4Science) once
exclusive to human intellect. To comprehensively evaluate current models'
performance in cognitive reasoning abilities, we introduce OlympicArena, which
includes 11,163 bilingual problems across both text-only and interleaved
text-image modalities. These challenges encompass a wide range of disciplines
spanning seven fields and 62 international Olympic competitions, rigorously
examined for data leakage. We argue that the challenges in Olympic competition
problems are ideal for evaluating AI's cognitive reasoning due to their
complexity and interdisciplinary nature, which are essential for tackling
complex scientific challenges and facilitating discoveries. Beyond evaluating
performance across various disciplines using answer-only criteria, we conduct
detailed experiments and analyses from multiple perspectives. We delve into the
models' cognitive reasoning abilities, their performance across different
modalities, and their outcomes in process-level evaluations, which are vital
for tasks requiring complex reasoning with lengthy solutions. Our extensive
evaluations reveal that even advanced models like GPT-4o only achieve a 39.97%
overall accuracy, illustrating current AI limitations in complex reasoning and
multimodal integration. Through the OlympicArena, we aim to advance AI towards
superintelligence, equipping it to address more complex challenges in science
and beyond. We also provide a comprehensive set of resources to support AI
research, including a benchmark dataset, an open-source annotation platform, a
detailed evaluation tool, and a leaderboard with automatic submission features.
comment: Accepted by NeurIPS 2024
♻ ☆ 360$^\circ$REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System
Large language model agents have demonstrated remarkable advancements across
various complex tasks. Recent works focus on optimizing the agent team or
employing self-reflection to iteratively solve complex tasks. Since these
agents are all based on the same LLM, only conducting self-evaluation or
removing underperforming agents does not substantively enhance the capability
of the agents. We argue that a comprehensive evaluation and accumulating
experience from evaluation feedback is an effective approach to improving
system performance. In this paper, we propose Reusable Experience Accumulation
with 360$^\circ$ Assessment (360$^\circ$REA), a hierarchical multi-agent
framework inspired by corporate organizational practices. The framework employs
a novel 360$^\circ$ performance assessment method for multi-perspective
performance evaluation with fine-grained assessment. To enhance the capability
of agents in addressing complex tasks, we introduce dual-level experience pool
for agents to accumulate experience through fine-grained assessment. Extensive
experiments on complex task datasets demonstrate the effectiveness of
360$^\circ$REA.
♻ ☆ Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Xiwen Liang, Min Lin, Weiqi Ruan, Rongtao Xu, Yuecheng Liu, Jiaqi Chen, Bingqian Lin, Yuzheng Zhuang, Xiaodan Liang
Existing methods for vision-language task planning excel in short-horizon
tasks but often fall short in complex, long-horizon planning within dynamic
environments. These challenges primarily arise from the difficulty of
effectively training models to produce high-quality reasoning processes for
long-horizon tasks. To address this, we propose Structured Preference
Optimization (SPO), which aims to enhance reasoning and action selection in
long-horizon task planning through structured preference evaluation and
optimized training strategies. Specifically, SPO introduces: 1)
Preference-Based Scoring and Optimization, which systematically evaluates
reasoning chains based on task relevance, visual grounding, and historical
consistency; and 2) Curriculum-Guided Training, where the model progressively
adapts from simple to complex tasks, improving its generalization ability in
long-horizon scenarios and enhancing reasoning robustness. To advance research
in vision-language long-horizon task planning, we introduce ExtendaBench, a
comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat
2.0, categorized into ultra-short, short, medium, and long tasks. Experimental
results demonstrate that SPO significantly improves reasoning quality and final
decision accuracy, outperforming prior methods on long-horizon tasks and
underscoring the effectiveness of preference-driven optimization in
vision-language task planning. Specifically, SPO achieves a +5.98% GCR and
+4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement
in Habitat over the best-performing baselines.
comment: 18 pages
♻ ☆ Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring NAACL 2025
Honglin Mu, Han He, Yuxin Zhou, Yunlong Feng, Yang Xu, Libo Qin, Xiaoming Shi, Zeming Liu, Xudong Han, Qi Shi, Qingfu Zhu, Wanxiang Che
Large language model (LLM) safety is a critical issue, with numerous studies
employing red team testing to enhance model security. Among these, jailbreak
methods explore potential vulnerabilities by crafting malicious prompts that
induce model outputs contrary to safety alignments. Existing black-box
jailbreak methods often rely on model feedback, repeatedly submitting queries
with detectable malicious instructions during the attack search process.
Although these approaches are effective, the attacks may be intercepted by
content moderators during the search process. We propose an improved transfer
attack method that guides malicious prompt construction by locally training a
mirror model of the target black-box model through benign data distillation.
This method offers enhanced stealth, as it does not involve submitting
identifiable malicious instructions to the target model during the search
phase. Our approach achieved a maximum attack success rate of 92%, or a
balanced value of 80% with an average of 1.5 detectable jailbreak queries per
sample against GPT-3.5 Turbo on a subset of AdvBench. These results underscore
the need for more robust defense mechanisms.
comment: Accepted by NAACL 2025
♻ ☆ SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph
Multi-entity question answering (MEQA) poses significant challenges for large
language models (LLMs), which often struggle to consolidate scattered
information across multiple documents. An example question might be "What is
the distribution of IEEE Fellows among various fields of study?", which
requires retrieving information from diverse sources e.g., Wikipedia pages. The
effectiveness of current retrieval-augmented generation (RAG) methods is
limited by the LLMs' capacity to aggregate insights from numerous pages. To
address this gap, this paper introduces a structured RAG (SRAG) framework that
systematically organizes extracted entities into relational tables (e.g.,
tabulating entities with schema columns like "name" and "field of study") and
then apply table-based reasoning techniques. Our approach decouples retrieval
and reasoning, enabling LLMs to focus on structured data analysis rather than
raw text aggregation. Extensive experiments on Wikipedia-based multi-entity QA
tasks demonstrate that SRAG significantly outperforms state-of-the-art
long-context LLMs and RAG solutions, achieving a 29.6% improvement in accuracy.
The results underscore the efficacy of structuring unstructured data to enhance
LLMs' reasoning capabilities.
♻ ☆ HelpSteer2-Preference: Complementing Ratings with Preferences ICLR 2025
Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, Yi Dong
Reward models are critical for aligning models to follow instructions, and
are typically trained following one of two popular paradigms: Bradley-Terry
style or Regression style. However, there is a lack of evidence that either
approach is better than the other, when adequately matched for data. This is
primarily because these approaches require data collected in different (but
incompatible) formats, meaning that adequately matched data is not available in
existing public datasets. To tackle this problem, we release preference
annotations (designed for Bradley-Terry training) to complement existing
ratings (designed for Regression style training) in the HelpSteer2 dataset. To
improve data interpretability, preference annotations are accompanied with
human-written justifications. Using this data, we conduct the first
head-to-head comparison of Bradley-Terry and Regression models when adequately
matched for data. Based on insights derived from such a comparison, we propose
a novel approach to combine Bradley-Terry and Regression reward modeling. A
Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on
RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. This
reward model can then be used with REINFORCE algorithm (RLHF) to align an
Instruct model to reach 85.0 on Arena Hard, which is No. 1 as of 1 Oct 2024. We
open-source this dataset (CC-BY-4.0 license) at
https://huggingface.co/datasets/nvidia/HelpSteer2#preferences-new -- 1-oct-2024
and openly release the trained Reward and Instruct models at
https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward and
https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct
comment: Accepted to ICLR 2025; 28 pages, 3 figures
♻ ☆ Women, Infamous, and Exotic Beings: What Honorific Usages in Wikipedia Reveal about the Socio-Cultural Norms
Honorifics serve as powerful linguistic markers that reflect social
hierarchies and cultural values. This paper presents a large-scale,
cross-linguistic exploration of usage of honorific pronouns in Bengali and
Hindi Wikipedia articles, shedding light on how socio-cultural factors shape
language. Using LLM (GPT-4o), we annotated 10, 000 articles of real and
fictional beings in each language for several sociodemographic features such as
gender, age, fame, and exoticness, and the use of honorifics. We find that
across all feature combinations, use of honorifics is consistently more common
in Bengali than Hindi. For both languages, the use non-honorific pronouns is
more commonly observed for infamous, juvenile, and exotic beings. Notably, we
observe a gender bias in use of honorifics in Hindi, with men being more
commonly referred to with honorifics than women.
♻ ☆ Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model ICLR 2025
Aligning language models (LMs) with human preferences has become a key area
of research, enabling these models to meet diverse user needs better. Inspired
by weak-to-strong generalization, where a strong LM fine-tuned on labels
generated by a weaker model can consistently outperform its weak supervisor, we
extend this idea to model alignment. In this work, we observe that the
alignment behavior in weaker models can be effectively transferred to stronger
models and even exhibit an amplification effect. Based on this insight, we
propose a method called Weak-to-Strong Preference Optimization (WSPO), which
achieves strong model alignment by learning the distribution differences before
and after the alignment of the weak model. Experiments demonstrate that WSPO
delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct
on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04
length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our
results suggest that using the weak model to elicit a strong model with a high
alignment ability is feasible.
comment: ICLR 2025(Spotlight)
♻ ☆ Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Social reasoning abilities are crucial for AI systems to effectively
interpret and respond to multimodal human communication and interaction within
social contexts. We introduce Social Genome, the first benchmark for
fine-grained, grounded social reasoning abilities of multimodal models. Social
Genome contains 272 videos of interactions and 1,486 human-annotated reasoning
traces related to inferences about these interactions. These traces contain
5,777 reasoning steps that reference evidence from visual cues, verbal cues,
vocal cues, and external knowledge (contextual knowledge external to videos).
Social Genome is also the first modeling challenge to study external knowledge
in social reasoning. Social Genome computes metrics to holistically evaluate
semantic and structural qualities of model-generated social reasoning traces.
We demonstrate the utility of Social Genome through experiments with
state-of-the-art models, identifying performance gaps and opportunities for
future research to improve the grounded social reasoning abilities of
multimodal models.
comment: Under Review, 22 pages
♻ ☆ When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Online misinformation remains a critical challenge, and fact-checkers
increasingly rely on embedding-based methods to retrieve relevant fact-checks.
Yet, when debunked claims reappear in edited forms, the performance of these
methods is unclear. In this work, we introduce a taxonomy of six common
real-world misinformation edits and propose a perturbation framework that
generates valid, natural claim variations. Our multi-stage retrieval evaluation
reveals that standard embedding models struggle with user-introduced edits,
while LLM-distilled embeddings offer improved robustness at a higher
computational cost. Although a strong reranker helps mitigate some issues, it
cannot fully compensate for first-stage retrieval gaps. Addressing these
retrieval gaps, our train- and inference-time mitigation approaches enhance
in-domain robustness by up to 17 percentage points and boost out-of-domain
generalization by 10 percentage points over baseline models. Overall, our
findings provide practical improvements to claim-matching systems, enabling
more reliable fact-checking of evolving misinformation.
♻ ☆ Semi-Parametric Retrieval via Binary Bag-of-Tokens Index
Information retrieval has transitioned from standalone systems into essential
components across broader applications, with indexing efficiency,
cost-effectiveness, and freshness becoming increasingly critical yet often
overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval
(SiDR), a bi-encoder retrieval framework that decouples retrieval index from
neural parameters to enable efficient, low-cost, and parameter-agnostic
indexing for emerging use cases. Specifically, in addition to using embeddings
as indexes like existing neural retrieval methods, SiDR supports a
non-parametric tokenization index for search, achieving BM25-like indexing
complexity with significantly better effectiveness. Our comprehensive
evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms
both neural and term-based retrieval baselines under the same indexing
workload: (i) When using an embedding-based index, SiDR exceeds the performance
of conventional neural retrievers while maintaining similar training
complexity; (ii) When using a tokenization-based index, SiDR drastically
reduces indexing cost and time, matching the complexity of traditional
term-based retrieval, while consistently outperforming BM25 on all in-domain
datasets; (iii) Additionally, we introduce a late parametric mechanism that
matches BM25 index preparation time while outperforming other neural retrieval
baselines in effectiveness.
♻ ☆ Explaining Caption-Image Interactions in CLIP models with Second-Order Attributions
Dual encoder architectures like CLIP models map two types of inputs into a
shared embedding space and predict similarities between them. Despite their
success, it is, however, not understood how these models compare their two
inputs. Common first-order feature-attribution methods can only provide limited
insights into dual-encoders since their predictions depend on
feature-interactions rather than on individual features. In this paper, we
first derive a second-order method enabling the attribution of predictions by
any differentiable dual encoder onto feature-interactions between its inputs.
Second, we apply our method to CLIP models and show that they learn
fine-grained correspondences between parts of captions and regions in images.
They match objects across input modes also account for mismatches. This
visual-linguistic grounding ability, however, varies heavily between object
classes and exhibits pronounced out-of-domain effects. We can identify
individual errors as well as systematic failure categories including object
coverage, unusual scenes and correlated contexts.
♻ ☆ Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
We propose Union-of-Experts (UoE), which decomposes transformer into an
equitant group of experts, and then implement selective routing on input data
and experts. Our approach advances MoE design with four key innovations: (1) We
conducted equitant expert decomposition on both MLP blocks and attention blocks
based on matrix partition in tensor parallelism. (2) We developed two routing
paradigms: patch-wise data selection and expert selection, to apply routing
across different levels. (3) We design the architecture of UoE model, including
Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We
develop parallel implementation of UoE's routing and computation operation, and
optimize efficiency based on the hardware processing analysis. The experiments
demonstrate that the UoE model surpass Full Attention, state-of-art MoEs and
efficient transformers (including the model architecture of recently proposed
DeepSeek-V3) in several tasks across image and natural language domains. In
language modeling tasks, we achieve an average reduction of 2.38 in perplexity
compared to the best-performed MoE method with an average of 76% FLOPs. In Long
Range Arena benchmark, we recorded an average score that is at least 0.68%
higher than all comparison models including Full Attention, MoEs, and
transformer variants, with only 50% FLOPs of the best MoE method. In image
classification, our model yielded an average accuracy improvement of 1.75% than
the best model while maintaining comparable FLOPs. The source codes are
available at https://github.com/YujiaoYang-work/UoE.
comment: 17 pages
♻ ☆ Pap2Pat: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs
Dealing with long and highly complex technical text is a challenge for Large
Language Models (LLMs), which still have to unfold their potential in
supporting expensive and timeintensive processes like patent drafting. Within
patents, the description constitutes more than 90% of the document on average.
Yet, its automatic generation remains understudied. When drafting patent
applications, patent attorneys typically receive invention reports (IRs), which
are usually confidential, hindering research on LLM-supported patent drafting.
Often, prepublication research papers serve as IRs. We leverage this duality to
build PAP2PAT, an open and realistic benchmark for patent drafting consisting
of 1.8k patent-paper pairs describing the same inventions. To address the
complex longdocument patent generation task, we propose chunk-based
outline-guided generation using the research paper as invention specification.
Our extensive evaluation using PAP2PAT and a human case study show that LLMs
can effectively leverage information from the paper, but still struggle to
provide the necessary level of detail. Fine-tuning leads to more patent-style
language, but also to more hallucination. We release our data and code
https://github.com/boschresearch/Pap2Pat.
♻ ☆ Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models AAAI 2025
Human values and their measurement are long-standing interdisciplinary
inquiry. Recent advances in AI have sparked renewed interest in this area, with
large language models (LLMs) emerging as both tools and subjects of value
measurement. This work introduces Generative Psychometrics for Values (GPV), an
LLM-based, data-driven value measurement paradigm, theoretically grounded in
text-revealed selective perceptions. The core idea is to dynamically parse
unstructured texts into perceptions akin to static stimuli in traditional
psychometrics, measure the value orientations they reveal, and aggregate the
results. Applying GPV to human-authored blogs, we demonstrate its stability,
validity, and superiority over prior psychological tools. Then, extending GPV
to LLM value measurement, we advance the current art with 1) a psychometric
methodology that measures LLM values based on their scalable and free-form
outputs, enabling context-specific measurement; 2) a comparative analysis of
measurement paradigms, indicating response biases of prior methods; and 3) an
attempt to bridge LLM values and their safety, revealing the predictive power
of different value systems and the impacts of various values on LLM safety.
Through interdisciplinary efforts, we aim to leverage AI for next-generation
psychometrics and psychometrics for value-aligned AI.
comment: Accepted at AAAI 2025
♻ ☆ Autoformalizing Natural Language to First-Order Logic: A Case Study in Logical Fallacy Detection
Translating natural language into formal language such as First-Order Logic
(FOL) is a foundational challenge in NLP with wide-ranging applications in
automated reasoning, misinformation tracking, and knowledge validation. In this
paper, we introduce Natural Language to First-Order Logic (NL2FOL), a framework
to autoformalize natural language to FOL step by step using Large Language
Models (LLMs). Our approach addresses key challenges in this translation
process, including the integration of implicit background knowledge. By
leveraging structured representations generated by NL2FOL, we use
Satisfiability Modulo Theory (SMT) solvers to reason about the logical validity
of natural language statements. We present logical fallacy detection as a case
study to evaluate the efficacy of NL2FOL. Being neurosymbolic, our approach
also provides interpretable insights into the reasoning process and
demonstrates robustness without requiring model fine-tuning or labeled training
data. Our framework achieves strong performance on multiple datasets. On the
LOGIC dataset, NL2FOL achieves an F1-score of 78%, while generalizing
effectively to the LOGICCLIMATE dataset with an F1-score of 80%.
♻ ☆ An LLM-based Agent for Reliable Docker Environment Configuration
Environment configuration is a critical yet time-consuming step in software
development, especially when dealing with unfamiliar code repositories. While
Large Language Models (LLMs) demonstrate the potential to accomplish software
engineering tasks, existing methods for environment configuration often rely on
manual efforts or fragile scripts, leading to inefficiencies and unreliable
outcomes. We introduce Repo2Run, the first LLM-based agent designed to fully
automate environment configuration and generate executable Dockerfiles for
arbitrary Python repositories. We address two major challenges: (1) enabling
the LLM agent to configure environments within isolated Docker containers, and
(2) ensuring the successful configuration process is recorded and accurately
transferred to a Dockerfile without error. To achieve this, we propose atomic
configuration synthesis, featuring a dual-environment architecture (internal
and external environment) with a rollback mechanism to prevent environment
"pollution" from failed commands, guaranteeing atomic execution (execute fully
or not at all) and a Dockerfile generator to transfer successful configuration
steps into runnable Dockerfiles. We evaluate Repo2Run~on our proposed benchmark
of 420 recent Python repositories with unit tests, where it achieves an 86.0%
success rate, outperforming the best baseline by 63.9%. Repo2Run is available
at https://github.com/bytedance/Repo2Run.
♻ ☆ Learning to Generate Structured Output with Schema Reinforcement Learning
Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, Maosong Sun
This study investigates the structured generation capabilities of large
language models (LLMs), focusing on producing valid JSON outputs against a
given schema. Despite the widespread use of JSON in integrating language models
with programs, there is a lack of comprehensive analysis and benchmarking of
these capabilities. We explore various aspects of JSON generation, such as
structure understanding, escaping, and natural language description, to
determine how to assess and enable LLMs to generate valid responses. Building
upon this, we propose SchemaBench features around 40K different JSON schemas to
obtain and assess models' abilities in generating valid JSON. We find that the
latest LLMs are still struggling to generate a valid JSON string. Moreover, we
demonstrate that incorporating reinforcement learning with a Fine-grained
Schema Validator can further enhance models' understanding of JSON schema,
leading to improved performance. Our models demonstrate significant improvement
in both generating JSON outputs and downstream tasks.
comment: 8 pages, 4 figures
♻ ☆ Gated Delta Networks: Improving Mamba2 with Delta Rule ICLR 2025
Linear Transformers have gained attention as efficient alternatives to
standard Transformers, but their performance in retrieval and long-context
tasks has been limited. To address these limitations, recent work has explored
two distinct mechanisms: gating for adaptive memory control and the delta
update rule for precise memory modifications. We observe that these mechanisms
are complementary: gating enables rapid memory erasure while the delta rule
facilitates targeted updates. Building on this insight, we introduce the gated
delta rule and develop a parallel training algorithm optimized for modern
hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses
existing models like Mamba2 and DeltaNet across multiple benchmarks, including
language modeling, common-sense reasoning, in-context retrieval, length
extrapolation, and long-context understanding. We further enhance performance
by developing hybrid architectures that combine Gated DeltaNet layers with
sliding window attention or Mamba2 layers, achieving both improved training
efficiency and superior task performance.
comment: ICLR 2025 camera ready
♻ ☆ Dual Reasoning: A GNN-LLM Collaborative Framework for Knowledge Graph Question Answering
Large Language Models (LLMs) excel at intuitive, implicit reasoning. Guiding
LLMs to construct thought chains can enhance their deliberate reasoning
abilities, but also faces challenges such as hallucination. Knowledge Graphs
(KGs) can provide explicit structured knowledge for LLMs to alleviate these
issues. However, existing KG-enhanced methods often overlook explicit graph
learning, making it challenging to efficiently provide precise reasoning chains
for LLMs. Following dual-process theory, we propose Dual-Reasoning (DualR), a
novel framework that integrates an external system based on Graph Neural
Network (GNN) for explicit reasoning on KGs, complementing the implicit
reasoning of LLMs through externalized reasoning chains. DualR designs an
LLM-empowered GNN module for explicit learning on KGs, efficiently extracting
high-quality reasoning chains. These reasoning chains are then refined to a
knowledge-enhanced multiple-choice prompt, guiding a frozen LLM to reason
thoughtfully for final answer determination. Extensive experiments on three
benchmark KGQA datasets demonstrate that DualR achieves state-of-the-art
performance while maintaining high efficiency and interpretability.
♻ ☆ R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Recent studies have combined Large Language Models (LLMs) with Knowledge
Graphs (KGs) to enhance reasoning, improving inference accuracy without
additional training while mitigating hallucination. However, existing
frameworks are often rigid, struggling to adapt to KG or task changes. They
also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
separates reasoning into two roles: an Operator (a low-capacity LLM) that
gathers evidence and a Supervisor (a high-capacity LLM) that makes final
judgments. This design is cost-efficient for LLM inference while still
maintaining strong reasoning accuracy. Additionally, R2-KG employs an
Abstention mechanism, generating answers only when sufficient evidence is
collected from KG, which significantly enhances reliability. Experiments across
multiple KG-based reasoning tasks show that R2-KG consistently outperforms
baselines in both accuracy and reliability, regardless of the inherent
capability of LLMs used as the Operator. Further experiments reveal that the
single-agent version of R2-KG, equipped with a strict self-consistency
strategy, achieves significantly higher-than-baseline reliability while
reducing inference cost. However, it also leads to a higher abstention rate in
complex KGs. Our findings establish R2-KG as a flexible and cost-effective
solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
while ensuring trustworthy inference.
♻ ☆ Markov Chain of Thought for Efficient Mathematical Reasoning NAACL 2025
Chain of Thought (CoT) of multi-step benefits from the logical structure of
the reasoning steps and task-specific actions, significantly enhancing the
mathematical reasoning capabilities of large language models. As the prevalence
of long CoT, the number of reasoning steps exceeds manageable token limits and
leads to higher computational demands. Inspired by the fundamental logic of
human cognition, "derive, then reduce", we conceptualize the standard
multi-step CoT as a novel Markov Chain of Thought (MCoT). In this study, we
consider the mathematical reasoning task, defining each reasoning step as text
accompanied by a Python code snippet. To facilitate a longer reasoning path,
self-correction is enabled through interactions with the code interpreter. Our
MCoT aims to compress previous reasoning steps into a simplified question,
enabling efficient next-step inference without relying on a lengthy KV cache.
In our experiments, we curate the $\texttt{MCoTInstruct}$ dataset, and the
empirical results indicate that MCoT not only significantly enhances efficiency
but also maintains comparable accuracy. While much remains to be explored, this
work paves the way for exploring the long CoT reasoning abilities of LLMs. The
code is available at https://github.com/james-yw/Markov-Chain-of-Thought
comment: Camera ready version for NAACL 2025 Main
♻ ☆ Investigating Non-Transitivity in LLM-as-a-Judge
Automatic evaluation methods based on large language models (LLMs) are
emerging as the standard tool for assessing the instruction-following abilities
of LLM-based agents. The most common method in this paradigm, pairwise
comparisons with a baseline model, critically depends on the assumption of
transitive preferences. However, the validity of this assumption remains
largely unexplored. In this study, we investigate the presence of
non-transitivity within the AlpacaEval framework and analyze its effects on
model rankings. We find that LLM judges exhibit non-transitive preferences,
leading to rankings that are sensitive to the choice of the baseline model. To
mitigate this issue, we show that round-robin tournaments combined with
Bradley-Terry models of preference can produce more reliable rankings. Notably,
our method increases both the Spearman correlation and the Kendall correlation
with Chatbot Arena (95.0% -> 96.4% and 82.1% -> 86.3% respectively). To address
the computational cost of round-robin tournaments, we propose Swiss-Wise
Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to
capture the benefits of round-robin tournaments while maintaining computational
efficiency.
comment: 8 pages, 6 figures, 2 tables (30 pages, 11 figures, 8 tables
including references and appendices)
♻ ☆ Training and Evaluating Language Models with Template-based Data Generation
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM,
and Llama has significantly transformed natural language processing, showcasing
remarkable capabilities in understanding and generating language. However,
these models often struggle with tasks requiring complex reasoning,
particularly in mathematical problem-solving, due in part to the scarcity of
large-scale, high-quality, domain-specific datasets necessary for training
sophisticated reasoning abilities. To address this limitation, we introduce
Template-based Data Generation (TDG), a novel approach that leverages LLMs
(GPT-4) to automatically generate parameterized meta-templates, which are then
used to synthesize a vast array of high-quality problems and solutions.
Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset
comprising over 7 million synthetically generated grade school math
problems--each accompanied by code-based and natural language solutions--with
the potential to generate an effectively unlimited number more. This dataset
alleviates the scarcity of large-scale mathematical datasets and serves as a
valuable resource for pre-training, fine-tuning, and evaluating LLMs in
mathematical reasoning. Our method not only enables the generation of virtually
infinite data but also elevates data augmentation to a new level by using GPT-4
for meta-template generation, ensuring diverse and high-quality problem
structures. The TemplateMath Part I: TemplateGSM dataset is publicly available
at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available
at https://github.com/iiis-ai/TemplateMath.
comment: 9 pages, 2 figures
♻ ☆ Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction
Junkai Liu, Yujie Tong, Hui Huang, Bowen Zheng, Yiran Hu, Peicheng Wu, Chuan Xiao, Makoto Onizuka, Muyun Yang, Shuyuan Zheng
Legal judgment prediction (LJP), which enables litigants and their lawyers to
forecast judgment outcomes and refine litigation strategies, has emerged as a
crucial legal NLP task. Existing studies typically utilize legal facts, i.e.,
facts that have been established by evidence and determined by the judge, to
predict the judgment. However, legal facts are often difficult to obtain in the
early stages of litigation, significantly limiting the practical applicability
of fact-based LJP. To address this limitation, we propose a novel legal NLP
task: \textit{legal fact prediction} (LFP), which takes the evidence submitted
by litigants for trial as input to predict legal facts, thereby empowering
fact-based LJP technologies to perform prediction in the absence of
ground-truth legal facts. We also propose the first benchmark dataset,
LFPBench, for evaluating the LFP task. Our extensive experiments on LFPBench
demonstrate the effectiveness of LFP-empowered LJP and highlight promising
research directions for LFP. Our code and data are available at
https://github.com/HPRCEST/LFPBench.
♻ ☆ Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages NAACL 2025
Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
Although multilingual LLMs have achieved remarkable performance across
benchmarks, we find they continue to underperform on non-Latin script languages
across contemporary LLM families. This discrepancy arises from the fact that
LLMs are pretrained with orthographic scripts, which are dominated by Latin
characters that obscure their shared phonology with non-Latin scripts. We
propose leveraging phonemic transcriptions as complementary signals to induce
script-invariant representations. Our study demonstrates that integrating
phonemic signals improves performance across both non-Latin and Latin script
languages, with a particularly significant impact on closing the performance
gap between the two. Through detailed experiments, we show that phonemic and
orthographic scripts retrieve distinct examples for in-context learning (ICL).
This motivates our proposed Mixed-ICL retrieval strategy, where further
aggregation from both leads to our significant performance improvements for
both Latin script languages (up to 12.6%) and non-Latin script languages (up to
15.1%) compared to randomized ICL retrieval.
comment: Accepted for NAACL 2025 (Main Conference)
♻ ☆ GENERator: A Long-Context Generative Genomic Foundation Model
Advancements in DNA sequencing technologies have significantly improved our
ability to decode genomic sequences. However, the prediction and interpretation
of these sequences remain challenging due to the intricate nature of genetic
material. Large language models (LLMs) have introduced new opportunities for
biological sequence analysis. Recent developments in genomic language models
have underscored the potential of LLMs in deciphering DNA sequences.
Nonetheless, existing models often face limitations in robustness and
application scope, primarily due to constraints in model structure and training
data scale. To address these limitations, we present GENERator, a generative
genomic foundation model featuring a context length of 98k base pairs (bp) and
1.2B parameters. Trained on an expansive dataset comprising 386B bp of
eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across
both established and newly proposed benchmarks. The model adheres to the
central dogma of molecular biology, accurately generating protein-coding
sequences that translate into proteins structurally analogous to known
families. It also shows significant promise in sequence optimization,
particularly through the prompt-responsive generation of enhancer sequences
with specific activity profiles. These capabilities position the GENERator as a
pivotal tool for genomic research and biotechnological advancement, enhancing
our ability to interpret and predict complex biological systems and enabling
precise genomic interventions. Implementation details and supplementary
resources are available at https://github.com/GenerTeam/GENERator.
♻ ☆ Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models ACL 2024
Although Retrieval-Augmented Large Language Models (RALMs) demonstrate their
superiority in terms of factuality, they do not consistently outperform the
original retrieval-free Language Models (LMs). Our experiments reveal that this
example-level performance inconsistency exists not only between
retrieval-augmented and retrieval-free LM but also among different retrievers.
To understand this phenomenon, we investigate the degeneration behavior of
RALMs and theoretically decompose it into four categories. Further analysis
based on our decomposition reveals that the innate difference in knowledge
sources and the unpredictable degeneration of the reader model contribute most
to the inconsistency. Drawing from our analysis, we introduce Ensemble of
Retrievers (EoR), a trainable framework that can adaptively retrieve from
different knowledge sources and effectively decrease unpredictable reader
errors. Our experiments on Open Domain Question Answering show that EoR
substantially improves performance over the RALM with a single retriever by
considerably reducing inconsistent behaviors.
comment: ACL 2024 (findings)
♻ ☆ MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen Xing
We present MultiChallenge, a pioneering benchmark evaluating large language
models (LLMs) on conducting multi-turn conversations with human users, a
crucial yet underexamined capability for their applications. MultiChallenge
identifies four categories of challenges in multi-turn conversations that are
not only common and realistic among current human-LLM interactions, but are
also challenging to all current frontier LLMs. All 4 challenges require
accurate instruction-following, context allocation, and in-context reasoning at
the same time. We also develop LLM as judge with instance-level rubrics to
facilitate an automatic evaluation method with fair agreement with experienced
human raters. Despite achieving near-perfect scores on existing multi-turn
evaluation benchmarks, all frontier models have less than 50% accuracy on
MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving
just a 41.4% average accuracy.
♻ ☆ Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs ICLR 2025
Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha
Large Vision-Language Models (LVLMs) often produce responses that misalign
with factual information, a phenomenon known as hallucinations. While
hallucinations are well-studied, the exact causes behind them remain
underexplored. In this paper, we first investigate the root causes of
hallucinations in LVLMs. Our findings reveal that existing mitigation
techniques primarily reduce hallucinations for visual recognition prompts-those
that require simple descriptions of visual elements-but fail for cognitive
prompts that demand deliberate reasoning. We identify the core issue as a lack
of true visual perception in LVLMs: although they can accurately recognize
visual elements, they struggle to fully interpret these elements in the context
of the input prompt and effectively link this recognition to their internal
knowledge, which is critical for reasoning. To address this gap, we introduce
Visual Description Grounded Decoding (VDGD), a simple, robust, and
training-free method designed to enhance visual perception and improve
reasoning capabilities in LVLMs. VDGD works by first generating a detailed
description of the image and appending it as a prefix to the instruction.
During response generation, tokens are sampled based on their KL divergence to
the description, favoring candidates with lower divergence. Experimental
results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD
consistently outperforms existing baselines 2% - 33%. Finally, we introduce
VaLLu, a benchmark designed for comprehensive evaluation of the cognitive
capabilities of LVLMs.
comment: Accepted to ICLR 2025. Project: https://sreyan88.github.io/VDGD/
♻ ☆ Unified Mind Model: Reimagining Autonomous Agents in the LLM Era
Large language models (LLMs) have recently demonstrated remarkable
capabilities across domains, tasks, and languages (e.g., ChatGPT and GPT-4),
reviving the research of general autonomous agents with human-like cognitive
abilities. Such human-level agents require semantic comprehension and
instruction-following capabilities, which exactly fall into the strengths of
LLMs. Although there have been several initial attempts to build human-level
agents based on LLMs, the theoretical foundation remains a challenging open
problem. In this paper, we propose a novel theoretical cognitive architecture,
the Unified Mind Model (UMM), which offers guidance to facilitate the rapid
creation of autonomous agents with human-level cognitive abilities.
Specifically, our UMM starts with the global workspace theory and further
leverage LLMs to enable the agent with various cognitive abilities, such as
multi-modal perception, planning, reasoning, tool use, learning, memory,
reflection and motivation. Building upon UMM, we then develop an agent-building
engine, MindOS, which allows users to quickly create domain-/task-specific
autonomous agents without any programming effort.
comment: 18 pages
♻ ☆ EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports NeurIPS
We introduce a novel question-answering (QA) dataset using echocardiogram
reports sourced from the Medical Information Mart for Intensive Care database.
This dataset is specifically designed to enhance QA systems in cardiology,
consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities
and their severity. We compare large language models (LLMs), including
open-source and biomedical-specific models for zero-shot evaluation, and
closed-source models for zero-shot and three-shot evaluation. Our results show
that fine-tuning LLMs improves performance across various QA metrics,
validating the value of our dataset. Clinicians also qualitatively evaluate the
best-performing model to assess the LLM responses for correctness. Further, we
conduct fine-grained fairness audits to assess the bias-performance trade-off
of LLMs across various social determinants of health. Our objective is to
propel the field forward by establishing a benchmark for LLM AI agents aimed at
supporting clinicians with cardiac differential diagnoses, thereby reducing the
documentation burden that contributes to clinician burnout and enabling
healthcare professionals to focus more on patient care.
comment: NeurIPS SafeGenAI 2024
♻ ☆ DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Recent LLM-driven visual agents mainly focus on solving image-based tasks,
which limits their ability to understand dynamic scenes, making it far from
real-life applications like guiding students in laboratory experiments and
identifying their mistakes. Hence, this paper explores DoraemonGPT, a
comprehensive and conceptually elegant system driven by LLMs to understand
dynamic scenes. Considering the video modality better reflects the
ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a
video agent. Given a video with a question/task, DoraemonGPT begins by
converting the input video into a symbolic memory that stores task-related
attributes. This structured representation allows for spatial-temporal querying
and reasoning by well-designed sub-task tools, resulting in concise
intermediate results. Recognizing that LLMs have limited internal knowledge
when it comes to specialized domains (e.g., analyzing the scientific principles
underlying experiments), we incorporate plug-and-play tools to assess external
knowledge and address tasks across different domains. Moreover, a novel
LLM-driven planner based on Monte Carlo Tree Search is introduced to explore
the large planning space for scheduling various tools. The planner iteratively
finds feasible solutions by backpropagating the result's reward, and multiple
solutions can be summarized into an improved final answer. We extensively
evaluate DoraemonGPT's effectiveness on three benchmarks and several
in-the-wild scenarios. The code will be released at
https://github.com/z-x-yang/DoraemonGPT.
♻ ☆ SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration ICLR 2025
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate
LLM inference without compromising quality. It works by first employing a
compact model to draft multiple tokens efficiently and then using the target
LLM to verify them in parallel. While this technique has achieved notable
speedups, most existing approaches necessitate either additional parameters or
extensive training to construct effective draft models, thereby restricting
their applicability across different LLMs and tasks. To address this
limitation, we explore a novel plug-and-play SD solution with layer-skipping,
which skips intermediate layers of the target LLM as the compact draft model.
Our analysis reveals that LLMs exhibit great potential for self-acceleration
through layer sparsity and the task-specific nature of this sparsity. Building
on these insights, we introduce SWIFT, an on-the-fly self-speculative decoding
algorithm that adaptively selects intermediate layers of LLMs to skip during
inference. SWIFT does not require auxiliary models or additional training,
making it a plug-and-play solution for accelerating LLM inference across
diverse input data streams. Our extensive experiments across a wide range of
models and downstream tasks demonstrate that SWIFT can achieve over a 1.3x-1.6x
speedup while preserving the original distribution of the generated text. We
release our code in https://github.com/hemingkx/SWIFT.
comment: ICLR 2025, camera-ready version
♻ ☆ Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization IJCAI 2025
Automatic music arrangement streamlines the creation of musical variants for
composers and arrangers, reducing reliance on extensive music expertise.
However, existing methods suffer from inefficient tokenization,
underutilization of pre-trained music language models (LMs), and suboptimal
fidelity and coherence in generated arrangements. This paper introduces an
efficient multitrack music tokenizer for unconditional and conditional symbolic
music generation, along with a unified sequence-to-sequence reconstruction
fine-tuning objective for pre-trained music LMs that balances task-specific
needs with coherence constraints. Our approach achieves state-of-the-art
results on band arrangement, piano reduction, and drum arrangement, surpassing
task-specific models in both objective metrics and perceptual quality.
Additionally, we demonstrate that generative pretraining significantly
contributes to the performance across these arrangement tasks, especially when
handling long segments with complex alignment.
comment: Submitted to IJCAI 2025
♻ ☆ Automatically Labeling Clinical Trial Outcomes: A Large-Scale Benchmark for Drug Development
Background The cost of drug discovery and development is substantial, with
clinical trial outcomes playing a critical role in regulatory approval and
patient care. However, access to large-scale, high-quality clinical trial
outcome data remains limited, hindering advancements in predictive modeling and
evidence-based decision-making.
Methods We present the Clinical Trial Outcome (CTO) benchmark, a fully
reproducible, large-scale repository encompassing approximately 125,000 drug
and biologics trials. CTO integrates large language model (LLM) interpretations
of publications, trial phase progression tracking, sentiment analysis from news
sources, stock price movements of trial sponsors, and additional trial-related
metrics. Furthermore, we manually annotated a dataset of clinical trials
conducted between 2020 and 2024 to enhance the quality and reliability of
outcome labels.
Results The trial outcome labels in the CTO benchmark agree strongly with
expert annotations, achieving an F1 score of 94 for Phase 3 trials and 91
across all phases. Additionally, benchmarking standard machine learning models
on our manually annotated dataset revealed distribution shifts in recent
trials, underscoring the necessity of continuously updated labeling approaches.
Conclusions By analyzing CTO's performance on recent clinical trials, we
demonstrate the ongoing need for high-quality, up-to-date trial outcome labels.
We publicly release the CTO knowledge base and annotated labels at
https://chufangao.github.io/CTOD, with regular updates to support research on
clinical trial outcomes and inform data-driven improvements in drug
development.
♻ ☆ EgoNormia: Benchmarking Physical Social Norm Understanding
Human activity is moderated by norms. However, machines are often trained
without explicit supervision on norm understanding and reasoning, especially
when the norms are grounded in a physical and social context. To improve and
evaluate the normative reasoning capability of vision-language models (VLMs),
we present EgoNormia $\|\epsilon\|$, consisting of 1,853 ego-centric videos of
human interactions, each of which has two related questions evaluating both the
prediction and justification of normative actions. The normative actions
encompass seven categories: safety, privacy, proxemics, politeness,
cooperation, coordination/proactivity, and communication/legibility. To compile
this dataset at scale, we propose a novel pipeline leveraging video sampling,
automatic answer generation, filtering, and human validation. Our work
demonstrates that current state-of-the-art vision-language models lack robust
norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench
of 92%). Our analysis of performance in each dimension highlights the
significant risks of safety, privacy, and the lack of collaboration and
communication capability when applied to real-world agents. We additionally
show that through a retrieval-based generation method, it is possible to use
EgoNormia to enhance normative reasoning in VLMs.
♻ ☆ Adversarial Decoding: Generating Readable Documents for Adversarial Objectives
We design, implement, and evaluate adversarial decoding, a new, generic text
generation technique that produces readable documents for different adversarial
objectives. Prior methods either produce easily detectable gibberish, or cannot
handle objectives that include embedding similarity. In particular, they only
work for direct attacks (such as jailbreaking) and cannot produce adversarial
text for realistic indirect injection, e.g., documents that (1) are retrieved
in RAG systems in response to broad classes of queries, and also (2)
adversarially influence subsequent generation. We also show that fluency (low
perplexity) is not sufficient to evade filtering. We measure the effectiveness
of adversarial decoding for different objectives, including RAG poisoning,
jailbreaking, and evasion of defensive filters, and demonstrate that it
outperforms existing methods while producing readable adversarial documents.
♻ ☆ METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Chart generation aims to generate code to produce charts satisfying the
desired visual properties, e.g., texts, layout, color, and type. It has great
potential to empower the automatic professional report generation in financial
analysis, research presentation, education, and healthcare. In this work, we
build a vision-language model (VLM) based multi-agent framework for effective
automatic chart generation. Generating high-quality charts requires both strong
visual design skills and precise coding capabilities that embed the desired
visual properties into code. Such a complex multi-modal reasoning process is
difficult for direct prompting of VLMs. To resolve these challenges, we propose
METAL, a multi-agent framework that decomposes the task of chart generation
into the iterative collaboration among specialized agents. METAL achieves 5.2%
improvement over the current best result in the chart generation task. The
METAL framework exhibits the phenomenon of test-time scaling: its performance
increases monotonically as the logarithmic computational budget grows from 512
to 8192 tokens. In addition, we find that separating different modalities
during the critique process of METAL boosts the self-correction capability of
VLMs in the multimodal context.
♻ ☆ Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge
Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Ji Hoon Hong, Dong Won Jeon, Sung Beom Cho, Ga-Yeon Baek, Kyung-Won Kwak, Dong-Hee Lee, Sun-Jin Choi, Jisu Bae, Chihoon Lee, Yunseo Kim, Jinsung Park, Hyunsouk Cho
Materials synthesis is vital for innovations such as energy storage,
catalysis, electronics, and biomedical devices. Yet, the process relies heavily
on empirical, trial-and-error methods guided by expert intuition. Our work aims
to support the materials science community by providing a practical,
data-driven resource. We have curated a comprehensive dataset of 17K
expert-verified synthesis recipes from open-access literature, which forms the
basis of our newly developed benchmark, AlchemyBench. AlchemyBench offers an
end-to-end framework that supports research in large language models applied to
synthesis prediction. It encompasses key tasks, including raw materials and
equipment prediction, synthesis procedure generation, and characterization
outcome forecasting. We propose an LLM-as-a-Judge framework that leverages
large language models for automated evaluation, demonstrating strong
statistical agreement with expert assessments. Overall, our contributions offer
a supportive foundation for exploring the capabilities of LLMs in predicting
and guiding materials synthesis, ultimately paving the way for more efficient
experimental design and accelerated innovation in materials science.
comment: under review